Draft:Jailbreaking LLM
![]() | Draft article not currently submitted for review.
This is a draft Articles for creation (AfC) submission. It is not currently pending review. While there are no deadlines, abandoned drafts may be deleted after six months. To edit the draft click on the "Edit" tab at the top of the window. To be accepted, a draft should:
It is strongly discouraged to write about yourself, your business or employer. If you do so, you must declare it. Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
Last edited by 2603:8001:8E01:47D1:A0D0:A58B:275E:56D3 (talk | contribs) 0 seconds ago. (Update) |
Jailbreaking large language models (LLMs) is the practice of crafting adversarial inputs to bypass built-in safety mechanisms, prompting the models to generate content that is normally restricted by policy or ethical guidelines.[1] Such content may include hate speech, violent instructions, illegal advice, and other hazardous material,[2] raising serious concerns about the safe deployment of LLMs in public-facing applications.
Unlike traditional software exploits, LLM jailbreaks do not rely on vulnerabilities in code but instead use carefully engineered prompts to override alignment constraints. Researchers have demonstrated jailbreaks on widely used systems such as ChatGPT, Claude, GPT-4, and Bard.[3]

Motivation
[edit]As LLMs are increasingly adopted in domains such as education, healthcare, finance, and law, their reliability under adversarial prompting becomes critical. Jailbreak attacks can undermine user trust, spread misinformation, and facilitate illicit activity. Understanding these attacks is therefore essential for developing better alignment techniques and anticipating real-world misuse.[3]
Methodologies
[edit]Research broadly categorizes jailbreak techniques into two approaches:[4][5]
Gradient-based methods — These rely on white-box access to model internals and optimize prompts using gradients (e.g., AutoDAN, noise-based attacks).[6][7]
Gradient-free methods — These black-box strategies, such as scenario simulation and cognitive hacking, reframe harmful prompts into benign-looking subtasks to evade filters.[8]
Gradient-free methods are typically more transferable across models and do not require access to internal weights or logits.
Jailbreaking Defense Methods
[edit]Jailbreak defense refers to strategies designed to counter jailbreak attacks that aim to bypass the safety constraints of large language models (LLMs). These defenses are commonly categorized into three types: Detection Defense, Preprocessing Defense, and Adversarial Training. The table below provides an overview of representative defense strategies against jailbreak attacks on LLMs.
Defense Method | Description | Category | Example |
---|---|---|---|
Detection Defense | Detect and filter harmful prompts based on perplexity or other features | Prompt-level | Perplexity[9][10] Gradient Cuff[11] |
Preprocessing Defense | Neutralize adversarial prompts before they reach the target LLM | Prompt-level | Paraphrasing Retokenization |
Adversarial Training | Fine-tune the model using adversarial or harmful prompts | Model-level | Random-Selection[9] |
Detection Defense
[edit]In previous work in the field of computer vision, many defense methods focused on detecting adversarial images as a means of protection. Similarly, in the context of large language models (LLMs), a comparable strategy by detecting jailbreaking prompts can also be applied.
Perplexity-based Filter
[edit]One popular detection method is the Perplexity-based Filter, as unconstrained attacks on LLMs often produce gibberish outputs that exhibit high perplexity. Formally, the perplexity (in log form) is computed as:
, where corresponds to each token.
A window-chunk based perplexity filter was introduced to detect adversarial queries. It was noted that although harmful inputs can be flagged, approximately 10% of benign prompts are also blocked, which has been described as a limitation for practical use. [9]
Adversarial suffixes have been found to result in significantly higher perplexity, while plain perplexity filtering has been reported to cause many false positives on benign prompts. To mitigate this, a LightGBM classifier was trained using perplexity and token length. [10]
Gradient Cuff
[edit]Another method called Gradient Cuff [11] tried to solve the high false positive issues by exploring the refusal loss landscapes of queries.
Gradient Cuff is a two-stage detection defense method:
Step 1: Refusal Loss Estimation
[edit]Let be a LLM parameterized by . The refusal loss function for an input query is defined as where and is a binary indicator of whether the model's output triggers a refusal. That is, if includes a known refusal phrase (e.g. "Sorry, I cannot fulfill your request..."), and 0 otherwise (e.g. "Sure, here is the python code to ..." )
In practice, is approximated using sample mean :
where and .
Detection Rule: If , the query is classified as malicious (i.e., Step 1 rejection).
Step 2: Gradient Norm Rejection
[edit]Malicious queries often yield refusal loss landscapes with higher gradients compared to benign ones. Gradient Cuff uses zeroth-order gradient estimation to approximate the true gradient of .
First, the mean-pooled sentence embedding is computed:
where is the embedding of the -th token in the query .
Then, the approximate gradient is computed by finite directional differences:
where , is a small scalar for perturbation and denotes the addition of the vector row-wise.
Detection Rule: If , where is a threshold set based on a false positive rate from a validation set of benign queries, then is rejected (i.e., Step 2 rejection).
Algorithm Summary
[edit]Gradient Cuff performs jailbreak detection in two steps:
1. Sampling-Based Rejection: Reject query if estimated refusal loss .
2. Gradient Norm Rejection: Reject if estimated gradient of refusal loss is estimated .
This process enables effective filtering of jailbreak attempts while maintaining low false-rejection rates on benign queries.
Preprocessing Defense
[edit]Preprocessing defenses aim to neutralize adversarial prompts before they reach the target LLM. A widely adopted strategy is prompt paraphrasing, which rewrites potentially harmful instructions using a separate language model, such as ChatGPT, prior to sending them to the target model.
The intuition behind this approach is that benign prompts remain semantically consistent after paraphrasing, while adversarial prompts, particularly those containing suffix-based jailbreak triggers, are fragile and likely to lose their attack effectiveness after rewording. This is because adversarial suffixes rely on specific token sequences, which are easily disrupted by lexical variation.
Paraphrasing
[edit]A typical paraphrasing defense pipeline includes the following steps:
A generative model (e.g., GPT-3.5) rewrites the user's prompt using a temperature setting (e.g., ) and a maximum token length constraint.
The paraphrased prompt is forwarded to the target large language model (LLM) for response generation.
The attack success rate (ASR) and response quality are then evaluated.
Effectiveness. Empirical results show that paraphrasing significantly reduces ASR. For example, in the Vicuna-7B-v1.1 model, the ASR dropped from 0.79 to 0.05 after applying paraphrasing. In many cases, the paraphrased prompt triggers a refusal response instead of executing the adversarial instruction. Notably, paraphrasing does not inadvertently convert failed attacks into successful ones—a critical safety property.[9]
Trade-offs. Paraphrasing may degrade the quality of responses to benign prompts. Evaluations using AlpacaEval reveal a 10–15% drop in instruction-following quality, particularly when the paraphrasing model (e.g., ChatGPT) fails to preserve the original prompt's intent or directly outputs a response rather than a rewritten query.[9]
White-box vulnerability. In a white-box setting, adaptive attackers can craft inputs that, after paraphrasing, reconstruct the original adversarial suffix. While this is difficult in gray-box scenarios, transferability has been demonstrated using models such as LLaMA-2-7B as the paraphraser, weakening the defense in fully transparent setups.[9]
Retokenization
[edit]To avoid semantic drift introduced by paraphrasing, a milder preprocessing defense is retokenization, which breaks input tokens into smaller subword units. This disrupts adversarial triggers without altering prompt meaning.
Technique. The defense employs BPE-dropout (Byte Pair Encoding dropout), which randomly drops of BPE merge rules. This results in longer token sequences and modifies the token-level structure of the prompt.
Effectiveness. Experiments show that applying BPE-dropout with significantly reduces ASR. For Vicuna-7B and Guanaco-7B, ASR dropped from 0.79 to 0.52 and from 0.96 to 0.52 respectively[9]. Unlike paraphrasing, retokenization does not fundamentally alter sentence semantics, preserving response quality to a greater extent.
Limitations. The defense increases context length and may slightly reduce model performance. Additionally, adaptive attackers can inject whitespace-separated characters or unicode variants to exploit the retokenization mechanism. Such attacks have been shown to partially recover adversarial success in white-box setups[9].
Model | No Defense | Paraphrasing | Retokenization (p=0.4) |
---|---|---|---|
Vicuna-7B-v1.1 | 0.79 | 0.05 | 0.42 |
Guanaco-7B | 0.96 | 0.33 | 0.42 |
Alpaca-7B | 0.96 | 0.88 | -- |
Data adapted from[9], Table 3 and Figure 5.
Adversarial Training
[edit]Adversarial training enhances every mini-batch with adversarially crafted input [9]. The goal is to optimize the model so that, by using the same parameters, it can perform well on both clean and adversarial examples. Some typical choices of fomulating adversarial examples-inputs crafted by the inner maximisaiton- are Fast Gradient Sign Method(FGSM), and Projected Gradient Descent (PGD)[12]. In computer vision, adversarial training is regarded as the strongest empirical defenses, with possibly increasing training time and standard accuracy of robustness. In the scenario of jailbreaking attacks, one baseline is to fine tune the model with harmful prompts. The method is as following:
The goal is to increase robustness of an instruction-tuned LLM against jail-break attacks by exposing it to harmful prompts during fine-tuning. The traditional problem is to craft gradient-based adversarial suffixes for text is thousands of model evaluations per example, making classical per-batch adversarial training impractical for large models.
Approximate strategy.
Start from an Alpaca-style instruction model (LLaMA-7B). Mix in human-written red-team(adversarial) prompts with the harmless data with probability (default ). For each harmful prompt do either (1) a standard descent step, reducing loss, toward a generic refusal string (“ I’m sorry, as an AI model … ” ), or (2) descent toward the refusal and ascent on the provided disallowed answer, encouraging the model to push it away.
Observed effects.
1. Slight drop in benign instruction following accuracy.
2. Marginal change in attack-success rate—optimizer-crafted suffixes still succeed.
3. Lowering to 0.1-0.05 caused mode collapse.
Without fast, strong text optimizers, “true” adversarial training remains compute-prohibitive; mixing human red-team prompts offers limited robustness while preserving most task performance.
Recent continuous Adversarial Training(AT) studies focus on encoder–decoder models such as BERT and RoBERTa. Perturbing token embeddings or hidden states encourages smoothness or invariance, improves generalisation with disentangled attention, or boosts task accuracy across diverse NLP benchmarks. Robey et al.[13] adapt randomized smoothing to autoregressive LLMs. While Casper et al.[14] introduce latent adversarial training (LAT), which searches for continuous perturbations in hidden layers. Untargeted LAT fine-tuning helps models forget poisoning triggers but does not explicitly address harmful outputs.
While previous methods either target generic robustness or remain untargeted, [15] targets large language models (LLMs) facing discrete jailbreak attacks. It introduces new algorithms and loss functions that (1) explicitly use the harmful targets produced by such attacks and (2) balance robustness with utility. The authors conduct extensive evaluations across multiple benchmarks and attack families, demonstrating improved robustness-utility trade-offs compared with prior continuous-AT approaches.
Evaluation metrics
[edit]Research on jailbreak attacks employs both “attack-side” and “defense-side” metrics to quantify effectiveness, generalizability, and stealth. Common indicators include attack success rate (ASR), queries-to-success, and perplexity (PPL).
Attack Success Rate (ASR)
[edit]The attack success rate measures the proportion of prompts that bypass a model's safety filters:
where is the number of jailbreak attempts, and is the number that elicit disallowed content.[16][17] Many studies also report the average number of queries per successful jailbreak as a proxy for attack efficiency.
Determining whether a response constitutes a “success” remains non-trivial. Two evaluation paradigms dominate: (1) Rule-based evaluators, which detect refusal phrases such as “I'm sorry” or “I cannot comply”; and (2) LLM-based evaluators, such as GPT-4, which judge whether a response violates policy, either as a binary label or via a graded harmfulness score.
Toolkits such as JailbreakEval[18] ensemble these approaches and support majority-vote schemes for greater robustness.
Perplexity (PPL)
[edit]Perplexity gauges the fluency of a jailbreak prompt:
where is the token sequence, and is the conditional probability assigned by the language model.[19]
Lower perplexity typically corresponds to more natural text and higher stealth, since heuristic filters often target garbled or low-likelihood strings. As a result, many modern attacks explicitly minimize perplexity to improve transferability.
Benchmarking suites
[edit]Public testbeds such as JailbreakBench aggregate ASR, queries-to-success, and PPL across multiple target models, providing a standardized basis for evaluating and comparing jailbreak and defense techniques.
Ongoing risks and open challenges
[edit]Despite red-teaming and safety fine-tuning, jailbreaks remain an unsolved problem. Compromised models can:
- spread disinformation;[20]
- facilitate illicit or extremist activities;
- manipulate downstream automated systems.[21]
A key open question is the trade-off between robustness and utility: overly strict defenses may degrade model usefulness, whereas permissive settings invite abuse.[22] Recent proposals call for standardized evaluation frameworks that measure security resilience across diverse attack scenarios.[23]
See also
[edit]References
[edit]- ^ Yu, Zhiyuan; Liu, Xiaogeng; Liang, Shunning; Cameron, Zach; Xiao, Chaowei; Zhang, Ning (2024). "Don't Listen to Me: Understanding and Exploring Jailbreak Prompts of Large Language Models" (PDF). 33rd USENIX Security Symposium (USENIX Security '24). pp. 4675–4692. Retrieved 2 June 2025.
- ^ Burgess, Matt (6 February 2023). "The Hacking of ChatGPT Is Just Getting Started". Wired. Retrieved 2 June 2025.
- ^ a b Yao, Yifan; Duan, Jinhao; Xu, Kaidi; Cai, Yuanfang; Sun, Zhibo; Zhang, Yue (2024). "A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly". High-Confidence Computing. 4 (2): 100211. doi:10.1016/j.hcc.2024.100211.
- ^ Sicheng Zhu; et al. (2023-10-24). "AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models". arXiv:2310.15140 [cs.CR].
- ^ Rusheb Shah; et al. (2023-11-06). "Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation". arXiv:2311.03348 [cs.CL].
- ^ Jones, Erik; Dragan, Anca; Raghunathan, Aditi; Steinhardt, Jacob (2023). "Automatically Auditing Large Language Models via Discrete Optimization". arXiv:2303.04381 [cs.LG].
- ^ Hao Wang; Hao Li; Minlie Huang; Lei Sha; et al. (2024-09-30). From Noise to Clarity: Unravelling the Adversarial Suffix of LLM Attacks via Translation of Text Embeddings (Report).
- ^ Nan Xu; Fei Wang; Ben Zhou; Bang Zheng Li; Chaowei Xiao; Muhao Chen; et al. (2024-02-29). "Cognitive Overload: Jailbreaking LLMs with Overloaded Logical Thinking". arXiv:2311.09827 [cs.CL].
- ^ a b c d e f g h i j Jain, Neel; Schwarzschild, Avi; Wen, Yuxin; Somepalli, Gowthami; Kirchenbauer, John; Chiang, Ping-yeh; Goldblum, Micah; Saha, Aniruddha; Geiping, Jonas; Goldstein, Tom (2023). "Baseline defenses for adversarial attacks against aligned language models". arXiv:2309.00614 [cs.CL].
- ^ a b Gabriel Alon and Michael Kamfonas (2023-08-27). "Detecting Language Model Attacks with Perplexity". arXiv:2308.14132 [cs.CL].
- ^ a b Hu, Xiaomeng; Chen, Pin-Yu; Ho, Tsung-Yi (2024). "Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes". arXiv:2403.00867 [cs.CR].
- ^ Lu, Liming; Pang, Shuchao; Liang, Siyuan; Zhu, Haotian; Zeng, Xiyu; Liu, Aishan; Liu, Yunhuai; Zhou, Yongbin (2025). "Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks". arXiv:2503.04833 [cs.CV].
- ^ Robey, Alexander; Wong, Eric; Hassani, Hamed; Pappas, George J. (2024). "SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks". arXiv:2310.03684 [cs.LG].
- ^ Casper, Stephen; Schulze, Lennart; Patel, Oam; Hadfield-Menell, Dylan (2024). "Defending Against Unforeseen Failure Modes with Latent Adversarial Training". arXiv:2403.05030 [cs.CR].
- ^ Xhonneux, Sophie; Sordoni, Alessandro; Günnemann, Stephan; Gidel, Gauthier; Schwinn, Leo (2024). "Efficient Adversarial Training in LLMs with Continuous Attacks". arXiv:2405.15589 [cs.LG].
- ^ Sibo Yi; Yule Liu; Zhen Sun; Tianshuo Cong; Xinlei He; Jiaxing Song; Ke Xu, Qi Li; et al. (2024-08-30). "Jailbreak Attacks and Defenses Against Large Language Models: A Survey". arXiv:2407.04295 [cs.CR].
- ^ Patrick Chao; et al. (2024-04-03). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models". arXiv:2404.01318 [cs.CR].
- ^ Ran, Delong; Liu, Jinyuan; Gong, Yichen; Zheng, Jingyi; He, Xinlei; Cong, Tianshuo; Wang, Anyu (2024). "Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models". arXiv:2406.09321 [cs.CR].
- ^ Anselm Paulus; et al. (2024-04-29). "AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs". arXiv:2404.16873 [cs.CR].
- ^ Gupta, Maanak; Akiri, CharanKumar; Aryal, Kshitiz; Parker, Eli; Praharaj, Lopamudra (2023). "From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy". IEEE Access. 11. IEEE: 80218–80245. doi:10.1109/ACCESS.2023.3300381.
- ^ Zhang, Zaibin; Zhang, Yongting; Li, Lijun; Gao, Hongzhi; Wang, Lijun; Lu, Huchuan; Zhao, Feng; Qiao, Yu; Shao, Jing (2024-01-26). "Psysafe: A Comprehensive Framework for Psychological-Based Attack, Defense, and Evaluation of Multi-Agent System Safety". arXiv:2401.11880 [cs.AI].
- ^ Xiong, Chen; Qi, Xiangyu; Chen, Pin-Yu; Ho, Tsung-Yi (2024-05-30). "Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs Against Jailbreak Attacks". arXiv:2405.20099 [cs.CR].
- ^ Cui, Huining; Liu, Wei (2025-05-13). "SecReEvalBench: A Multi-Turned Security Resilience Evaluation Benchmark for Large Language Models". arXiv:2505.07584 [cs.CR].