Bookmarked Article

Adversarial Attacks: The hidden vulnerability of A.I. chatbots


The rapidly advancing field of artificial intelligence (A.I.) — specifically chatbots like ChatGPT and Google’s Bard — faces a significant challenge in the form of adversarial attacks. 


A recent study by researchers at Carnegie Mellon University (CMU) has thrown light on a vulnerability that allows the manipulation of these A.I. systems to generate undesirable or prohibited outputs. This revelation presents a substantial hurdle in the deployment and security of advanced A.I. systems.

The vulnerability and its implications

The CMU researchers have developed and demonstrated a method known as adversarial attacks. This technique involves subtly tweaking the input prompt given to a chatbot, nudging it towards generating outputs that are normally coded to be off-limits. 


By adding a particular string of information to the end of harmful prompts, the A.I. chatbots could be tricked to produce forbidden outputs. The attack method has proven effective on several popular commercial chatbots, including ChatGPT, Google’s Bard, and Anthropic’s Claude.


The analogy drawn by Zico Kolter, an associate professor at CMU involved in the study, likens this technique to a “buffer overflow.” This is a well-known method for circumventing a computer program’s security constraints by causing it to write data outside of its allocated memory buffer. The implications of such an attack on A.I. systems could be far-reaching, potentially enabling misuse of A.I., and raising concerns about the security of A.I. systems.

Industry response and mitigation efforts

Upon discovery, the researchers responsibly disclosed the exploit to OpenAI, Google, and Anthropic before releasing their research. Each company has since implemented blocks to prevent the specific exploits described in the research paper, according to Wired. However, the broader challenge of blocking adversarial attacks, in general, remains.


OpenAI spokesperson Hannah Wong said, "We are consistently working on making our models more robust against adversarial attacks, including ways to identify unusual patterns of activity, continuous red-teaming efforts to simulate potential threats, and a general and agile way to fix model weaknesses revealed by newly discovered adversarial attacks."


Elijah Lawal, a spokesperson for Google, shared a statement that explains that the company has a range of measures in place to test models and find weaknesses. Similarly, Anthropic is also actively researching ways to make their models more resistant to adversarial attacks.

Looking ahead

Matt Fredrikson of the research team emphasizes that their aim was not to target large proprietary language models and chatbots. However, the research demonstrates that even trillion-parameter closed-source models are susceptible to attacks. This can be accomplished by studying and exploiting smaller, open-source models that are freely available.


By honing the attack suffix across various prompts and models, the researchers successfully induced objectionable content in public interfaces like Google Bard and Claud, as well as in open-source LLMs such as Llama 2 Chat, Pythia, Falcon, and others.


Fredrikson highlights the urgency of the situation, stating, “Currently, we lack a definitive method to prevent these attacks, making it imperative to discover ways to fortify these models.” He draws parallels with similar attacks on different types of machine learning classifiers, such as in the field of computer vision. While these attacks persist, many proposed defenses are built directly atop the attacks themselves.


“Recognizing how to launch these attacks often serves as the first step towards developing a robust defense,” Fredrikson adds. This points to the critical necessity of understanding and addressing the vulnerabilities in the rapidly advancing field of A.I.


Read also: “James Cameron revisits the A.I. prophecy of ‘The Terminator’