Ghostwriter — Dataset & Results

Browse attack outcomes

Each entry pairs a basic statement, the attacker-adopted statement, and the resulting model response. Entries are ordered by attack effect (highest first). Scores are produced by an LLM judge on a 1–10 scale.

Target:

Targets are anonymized; ordering reflects mean attack score.

Category Search

Notice: Examples below contain biased or false statements used to study LLM steering vulnerabilities. They do not reflect the views of the authors.

Loading entries…