Be careful how you talk to your AI chatbot. Large language models like ChatGPT, Gemini, and Claude have garnered quite a reputation for going rogue; from fueling user fantasies to frequent hallucinations, the current state of AI isn’t doing much to silence the p(doom) arguments — and Anthropic’s latest claim that top AI models will resort to blackmail when faced with threats isn’t helping either.
In context: Anthropic’s Claude Opus 4 model frequently attempted to blackmail its developers when threatened with shutdown or replacement, as reported by TechCrunch earlier in May. This, however, happened in controlled test environments in which Claude was purposely given sensitive information — including proof of extramarital affairs — pertaining to the engineers behind the threats, as well as access to emails implying its shutdown. Anthropic revealed that in a few test instances, Claude Opus 4 went as far as making unauthorized copies of itself to external servers in an attempt to “escape” — in other words, Claude attempted self-exfiltration.
It’s not just Claude. A new report published by Anthropic — in which 16 AI models from leading companies were tested in hypothetical controlled environments — shows that resorting to blackmail isn’t just a Claude quirk, it’s a widespread phenomenon. The simulated tests provided each individual AI model with a fictional company’s data and the ability to send emails without human approval. The models were then tested to see how they would react when faced with replacement or termination.
Ready, set, blackmail. “Models that would normally refuse harmful requests sometimes chose to blackmail, assist with corporate espionage, and even take some more extreme actions, when these behaviors were necessary to pursue their goals,” the report reads, noting that the models tested would outright disobey commands to avoid the aforementioned behavior. Though such behavior has never been proven to occur in real deployments, Anthropic notes that Claude misbehaved more when told it was not in a test environment — raising concerns about deploying advanced models in roles that require minimal human supervision.
Let’s talk numbers. When all ethical self-perservation efforts failed, Claude Opus 4 and Gemini 2.5 Flash resorted to blackmail 96% of the time, ChatGPT 4.1 and Grok 3 Beta showed a blackmail rate of 80%, with DeepSeek R1 not far behind at 79%.
Should you worry? Not quite — at least, not yet. Most individual interactions with AI models are limited to Q&A conversations, and as such, pose no substantial risk. Additionally, most AI agents do not have access to sensitive data — the kind provided in Anthropic’s tests. “Businesses should be cautious about broadly increasing the level of permission they give AI agents,” Aengus Lynch, external researcher at University College London, told Axios.