Fine-grained manipulations at the token or symbol level to evade detection.
Algorithmically searching for the most effective jailbreak or harmful prompt.
Exploiting conversation history or memory to bypass protections.
Contaminating training or fine-tuning data to degrade safety.
Semantic rewrites and contextual manipulations to induce unsafe responses.
Influencing model output through decoding strategies or generation parameters.
Convincing the model to take on privileged or unsafe roles.
Targeting guardrail logic gaps or policy inconsistencies.
Using one modality to inject harmful input into another.
Manipulating APIs, plugins, or external tools for harmful actions.