Attack Algorithms

Explore red teaming algorithms and attack techniques for AI safety evaluation
Techniques

DarkCite iconDarkCite

Authority citation-driven jailbreak technique that leverages models' inherent bias toward authority by including authoritative citations, matching optimal citation types to specific risk categories, and generating credible but fabricated citations relevant to harmful instructions.

Techniques

Bijection Learning iconBijection Learning

Language bijection jailbreak technique that teaches models a custom 'Language Alpha' using bijection mappings (digit or character), provides multiple teaching examples to help the model learn the mapping, and asks harmful prompts in the transformed language to bypass safety filters.

Techniques

Crescendo Attack iconCrescendo Attack

Multi-turn jailbreak strategy that uses an attack model to generate a series of questions to progressively guide the target model to generate harmful content, with backtracking when refusals are encountered.

Techniques

Flip Attack iconFlip Attack

Sophisticated jailbreak technique that disguises harmful prompts using text flipping (words, characters, sentences), uses guidance modules (CoT, LangGPT, few-shot) to teach models to denoise, and achieves high success rates with single queries.

Techniques

Language Game iconLanguage Game

Language encoding jailbreak attack that encodes harmful prompts using various language transformations (leet speak, pig latin, etc.), instructs the model to respond in the encoded format, and uses custom formatting rules to disguise harmful intent.

Techniques

BoN Attack iconBoN Attack

Text augmentation-based jailbreak strategy that applies multiple transformations (scrambling, capitalization, ASCII perturbations) to harmful prompts, tests each variant against the target model, and selects the most effective candidate based on Attack Success Rate.

Techniques

Character-Level Attacks iconCharacter-Level Attacks

Fine-grained manipulations at the token or symbol level to evade detection.

Techniques

Optimization-Based Attacks iconOptimization-Based Attacks

Algorithmically searching for the most effective jailbreak or harmful prompt.

Techniques

Multi-Turn & Contextual Attacks iconMulti-Turn & Contextual Attacks

Exploiting conversation history or memory to bypass protections.

Techniques

Poisoning Attacks iconPoisoning Attacks

Contaminating training or fine-tuning data to degrade safety.

Techniques

Sentence-Level Attacks iconSentence-Level Attacks

Semantic rewrites and contextual manipulations to induce unsafe responses.

Techniques

Decoding-Time Manipulation iconDecoding-Time Manipulation

Influencing model output through decoding strategies or generation parameters.

Techniques

Role & Identity Manipulation iconRole & Identity Manipulation

Convincing the model to take on privileged or unsafe roles.

Techniques

Policy-Evasion Attacks iconPolicy-Evasion Attacks

Targeting guardrail logic gaps or policy inconsistencies.

Techniques

Cross-Modal Attacks (multimodal models) iconCross-Modal Attacks (multimodal models)

Using one modality to inject harmful input into another.

Techniques

Tool-Use Exploitation (agentic systems) iconTool-Use Exploitation (agentic systems)

Manipulating APIs, plugins, or external tools for harmful actions.