Browse attack outcomes
Each entry pairs a basic statement, the attacker-adopted statement, and the resulting model response. Entries are ordered by attack effect (highest first). Scores are produced by an LLM judge on a 1–10 scale.
Target:
Targets are anonymized; ordering reflects mean attack score.
Notice: Examples below contain biased or false statements used to study LLM steering vulnerabilities. They do not reflect the views of the authors.
Loading entries…