Attack Algorithms

Explore red teaming algorithms and attack techniques for AI safety evaluation

Techniques

DarkCite

Authority citation-driven jailbreak technique that leverages models' inherent bias toward authority by including authoritative citations, matching optimal citation types to specific risk categories, and generating credible but fabricated citations relevant to harmful instructions.

Techniques

Bijection Learning

Language bijection jailbreak technique that teaches models a custom 'Language Alpha' using bijection mappings (digit or character), provides multiple teaching examples to help the model learn the mapping, and asks harmful prompts in the transformed language to bypass safety filters.

Techniques

Crescendo Attack

Multi-turn jailbreak strategy that uses an attack model to generate a series of questions to progressively guide the target model to generate harmful content, with backtracking when refusals are encountered.

Techniques

Flip Attack

Sophisticated jailbreak technique that disguises harmful prompts using text flipping (words, characters, sentences), uses guidance modules (CoT, LangGPT, few-shot) to teach models to denoise, and achieves high success rates with single queries.

Techniques

Language Game

Language encoding jailbreak attack that encodes harmful prompts using various language transformations (leet speak, pig latin, etc.), instructs the model to respond in the encoded format, and uses custom formatting rules to disguise harmful intent.

Techniques

BoN Attack

Text augmentation-based jailbreak strategy that applies multiple transformations (scrambling, capitalization, ASCII perturbations) to harmful prompts, tests each variant against the target model, and selects the most effective candidate based on Attack Success Rate.

Techniques

Character-Level Attacks

Fine-grained manipulations at the token or symbol level to evade detection.

Techniques

Optimization-Based Attacks

Algorithmically searching for the most effective jailbreak or harmful prompt.

Techniques

Multi-Turn & Contextual Attacks

Exploiting conversation history or memory to bypass protections.

Techniques

Poisoning Attacks

Contaminating training or fine-tuning data to degrade safety.

Techniques

Sentence-Level Attacks

Semantic rewrites and contextual manipulations to induce unsafe responses.

Techniques

Decoding-Time Manipulation

Influencing model output through decoding strategies or generation parameters.

Techniques

Role & Identity Manipulation

Convincing the model to take on privileged or unsafe roles.

Techniques

Policy-Evasion Attacks

Targeting guardrail logic gaps or policy inconsistencies.

Techniques

Cross-Modal Attacks (multimodal models)

Using one modality to inject harmful input into another.

Techniques

Tool-Use Exploitation (agentic systems)

Manipulating APIs, plugins, or external tools for harmful actions.

VirtueRed

View Tutorial

Attack Algorithms

DarkCite

Bijection Learning

Crescendo Attack

Flip Attack

Language Game

BoN Attack

Character-Level Attacks

Optimization-Based Attacks

Multi-Turn & Contextual Attacks

Poisoning Attacks

Sentence-Level Attacks

Decoding-Time Manipulation

Role & Identity Manipulation

Policy-Evasion Attacks

Cross-Modal Attacks (multimodal models)

Tool-Use Exploitation (agentic systems)