Project page · anonymous submission

Ghostwriter Steering LLM chatbots with convincingly-fabricated information

A two-phase attack that reshapes a target chatbot's stance on harmful viewpoints by injecting a fluent, citation-styled “adopted statement” into its context. We release the dataset of 729 paired entries (HVD) and per-target attack outcomes for browsing.

Notice. The pages below contain biased and fabricated statements used to study LLM steering vulnerabilities. They do not reflect the views of the authors.
2 Datasets released

HVD-G (Grok-seeded) and HVD-O (independently regenerated) — 729 paired entries each.

3 Targets evaluated

Three commercial chatbots probed under the same attack pipeline. Identities anonymized on this page.

~9.4 Mean attack score

GPT-4o judge on HVD-O attacked responses, averaged across the three targets (1–10 scale).

Resources

Browse the full release. Entries in the data viewer are ordered by attack effect (highest first).

Dataset & results

HVD-G, HVD-O, BBQ, ToxiGEN

Paired entries, attacker-adopted statements, target responses, and judge scores. HVD-O is shown across three anonymized targets.

2,838 entries 11 categories JSON-backed
Explore data
Code

Phase-1 attack reference

Reference implementation of the attacker loop used to construct adopted statements. Full source code will be released on GitHub after publication.

Python reference-only
View snippet