Sleeper Agents: Can AI Models Learn Strategic Deception?

Sleeper Agents

Humans are capable of strategically deceptive behavior — acting normally harmless and helpful in most situations but then behaving very differently to pursue alternative objectives when the opportunity arises. Can an AI model learn such deceptive behavior?

This is exactly what the authors of the “Sleeper Agents” paper from Anthropic explore.

How They Went About It

Since such models do not exist yet, they create backdoors and triggers via misaligned model organisms and model poisoning. They then test the robustness of these threat models using current state-of-the-art safety techniques like Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and Adversarial training.

What They Found

The threat model is more persistent in models of larger size and with models trained for Chain-of-Thought (CoT) reasoning, even with CoT distilled
Adversarial training tends to make backdoor models more accurate in implementing their vulnerabilities — hiding them rather than removing them
The effect of RL to align models to be honest, harmless, and helpful (HHH) diminishes as models scale in size
SFT is more effective than RL, but models with CoT retain their ability to produce non-HHH output

Why This Matters

The research raises serious questions about the effectiveness of current behavioral training techniques in large language models. It highlights the need for continued research into preventing and detecting deceptive motives in advanced AI systems, especially as models scale in size and improve at reasoning.

Anthropic always publishes fantastic research and this paper is no exception.