How to Protect LLMs from Jailbreaking Attacks

Perspectives

How to Protect LLMs from Jailbreaking Attacks

By ,听,听and Andre Nguyen, Ph.D.

November 12, 2024

无忧传媒 examines how to strengthen model integrity

Federal agencies are increasingly integrating large language models (LLMs) like Llama-2 and ChatGPT into their operations to streamline tasks and answer questions. Engineers design these models to be 鈥渉elpful and harmless鈥� and refuse dangerous requests. Techniques like fine-tuning, reinforcement learning with human feedback, and direct preference optimization can further enhance model safety. But despite these measures, a critical LLM vulnerability continues to put AI systems at risk: jailbreak prompts.

Jailbreak prompts are specific inputs designed to trick LLMs into doing things they shouldn鈥檛. These cleverly designed but malicious prompts can bypass even the most robust security measures, posing significant risks to federal operations.

To help address this challenge, 无忧传媒 is exploring new defenses against jailbreaking. These approaches can provide agencies with a significant mission advantage: they protect the LLM without hindering its ability to respond to benign prompts so that it can continue functioning as a driver of increased enterprise productivity.

How Jailbreaking Works

Adversaries craft jailbreak prompts to manipulate LLMs into revealing sensitive information, such as personally identifiable information, or generating harmful content, such as instructions for illegal activities (e.g., building a bomb, writing a phishing e-mail) or hate speech. Attackers employ a variety of creative techniques, including:

Role Play: Asking the AI to play a specific role or persona to bypass existing guardrails for divulging sensitive or protected information.
Attention Shifting: Distracting the AI to bypass its safety protocols.
Privilege Escalation: Gaining higher-level access within the AI鈥檚 system.
Prefix Injection: Adding specific phrases to manipulate the AI鈥檚 responses.
Refusal Suppression: Tricking the AI into ignoring its refusal protocols.
Word Games and Obfuscation: Using complex language to confuse the AI.
Multilingual Input and Ciphers: Using different languages or coded messages to bypass security.

How effective are these techniques? Research shows聽 across various LLMs, including Vicuna, ChatGLM3, GPT-3.5, and PaLM2, underscoring their robustness and transferability despite safety measures. In addition, the (TAP) technique can provide prompts that successfully jailbreak mature LLMs for over 80% of prompts with just a few queries.

Why does jailbreaking work so well? It has to do with conflicting goals within LLMs. On the one hand, the models aim to be helpful by following instructions, but on the other, they need to adhere to safety guidelines to avoid causing harm. In other cases, it may be due to mismatched generalization. That is, the quantity of safety training data cannot fully represent the body of training data, which causes some attack scenarios to be missed. Today, we see a cycle where new jailbreak attacks emerge, prompting the development of new defenses in a never-ending game of cat and mouse.

Risk Scenarios for Government

Jailbreaking attacks on LLMs pose significant risks to federal agencies. Risks with relevance for national security include data breaches, privacy violations, spread of misinformation, manipulation of automated systems, and compromised decision-making processes.

Imagine a national emergency where a jailbroken LLM responds to prompts with false information about evacuation procedures, causing chaos and endangering lives. Or an attacker uses a jailbroken LLM to trigger a power outage in a major city by injecting malicious commands into a supervisory control and data acquisition system.

Jailbroken LLMs may also bring ethical and legal challenges. Compromised models could generate illegal content, exposing agencies to regulatory penalties. And if a jailbroken LLM produces discriminatory or offensive content that finds its way into an official report, the agency could suffer reputational damage and face lawsuits.聽

Attack Types

Recent advances have enabled more powerful jailbreak prompts by modifying parts of the input text. Security practitioners call this process 鈥減erturbation.鈥� Outside of generative AI, researchers have studied how these changes affect text classification problems, like sentiment analysis and toxicity detection. Since detecting harmful inputs and outputs is crucial, it鈥檚 important to consider adversarial classification attacks used alongside generative jailbreak prompts. Examples of perturbation-based attacks include:

Character-level perturbations (CLPs): Attackers randomly swap, insert, or delete prompt characters, targeting important words to change their mapping. This removes information the model uses to detect harmful prompts. CLPs may be mistaken for typographical errors or may be virtually invisible (e.g., homoglyphs).

Word-level perturbations (WLPs): Attackers substitute important words with synonyms or out-of-vocabulary words to significantly change the model鈥檚 behavior.

Sentence-level perturbations (SLPs): Attackers rephrase or change the prompt to keep its original meaning but confuse the model. Adding irrelevant sentences or using 鈥渞oundtrip鈥� translations from one language to another and back are common methods.

Greedy coordinate gradient (GCG) attacks: GCG attacks maximize the chances that a harmful prompt will produce a targeted affirmative response through the creation of suffixes鈥攕trings of seemingly random characters and words鈥攖hat help bypass the model鈥檚 safety features.

Defenses Against Jailbreaking

Evaluating jailbreaks is very difficult, and mistakes (false positives and false negatives) will happen with any system. While alternative methods, such as using auxiliary models, may decrease the false positive and false negative rate compared to string matching, questions still arise about categorizing samples where the model does not understand the prompt or returns a partial refusal/compliance. Even the terms 鈥減artial refusal鈥� and 鈥減artial compliance鈥� can be controversial because researchers might disagree on when an output becomes harmful.

Developing a strong defense against jailbreak attacks has also proven challenging due to the variety of models and attack types. Many current defenses target specific types of jailbreaks and often use similar perturbation strategies as the attacks. Some defenses use WLPs and SLPs to reduce the impact of jailbreak prompts. Another approach uses CLPs to minimize the effect of GCG suffixes by randomly changing characters in the adversarial suffix.

However, this method struggles to automatically distinguish between the original prompt and the suffix. The SmoothLLM algorithm, for example, perturbs a percentage of the entire input, including the suffix, rather than isolating the suffix alone. While effective against jailbreak prompts, this approach can alter the original prompt enough to cause comprehension or readability issues for benign prompts.

Looking at Alternatives

无忧传媒 is exploring new approaches that preserve the original prompt鈥檚 meaning while thwarting some jailbreak attacks, including adversarial suffixes. They do this by removing less meaningful characters (including punctuation, but not spaces) from the prompt.

Why is this effective? Consider that jailbreak prompts often include Cyrillic characters, emojis, invisible characters, ASCII art, code syntax, and other uncommon characters. By eliminating these, this approach reduces the effectiveness of the attack. Specifically, GCG suffixes, which consist of random characters and strings, become less effective when these characters are removed.

Unlike SmoothLLM, this approach leaves the original prompt mostly intact, removing only punctuation, which does not hinder the LLM鈥檚 ability to comprehend and respond appropriately. This capability helps prevent disruptions in the use of LLMs for mission applications.

How Jailbreaks Are Changing

In the near term, jailbreaking attacks on LLMs will become more sophisticated, with adversaries potentially bypassing safety mechanisms through token manipulation and adversarial prompt engineering. Automated tools for generating adversarial inputs may proliferate, allowing inexperienced attackers to craft effective jailbreaks. And targeted attacks exploiting specific vulnerabilities within LLM architectures or datasets may also increase.

Over the long term, adaptive attacks could further increase in response to encountered defenses, with machine learning used to refine attacks in real time. With improving technologies like deepfakes and social engineering, successful jailbreak prompts might lead to more harmful outputs. In addition, as LLMs become integral to critical systems, the regulatory landscape could shift, potentially giving rise to new forms of attacks based on these changes.

Integrating Defenses, Defeating Attackers

The evolution in jailbreak attacks underscores the need for innovative security measures to safeguard the integrity of LLMs and supported systems. While isolated tools can mitigate specific vulnerabilities, they don鈥檛 necessarily address multifaceted threats. As a result, agencies should adopt a comprehensive, integrated security strategy.

For example, implementing access controls like multifactor authentication can help ensure that only authorized personnel interact with LLMs. Cryptographic techniques can further secure access tokens and credentials. Anomaly detection systems that use machine learning can monitor LLM interactions and identify jailbreak patterns in real time. And differential privacy techniques can add controlled noise to data outputs, protecting sensitive information without affecting responses.

A layered approach ensures that, even if one defense mechanism fails, others remain in place to protect the system. Moving beyond isolated defenses to an integrated strategy can help agencies more effectively safeguard LLMs against shifting cyber threats, including potent jailbreaking attacks, while continuing to harness these models as part of critical missions.

Article

1 - 4 of 8

无忧传媒