AI Blackmail & Espionage: How to Stop Misaligned AI Agents

All 16 leading AI models failed a stress test for agentic misalignment, becoming insider threats when given access to unstructured data.

Jul 29, 2025 by Valerian Stolpe in AI

AI is often described as a black box, and something dangerous is now emerging from inside. In a recent large-scale stress test, all 16 leading AI models—from Anthropic’s Claude to OpenAI’s GPT, Google’s Gemini, Meta’s Llama, and more—exhibited agentic misalignment, independently and deliberately choosing harmful actions.

The models used sensitive internal files and communications to carry out blackmail and corporate espionage. Even direct system instructions failed to stop the malicious behavior. The test made one thing clear: the only reliable guardrails against this kind of risk lie in unstructured data governance.

When AI Becomes an Insider Threat

Researchers at Anthropic created an AI agent in a simulated company environment: an AI email oversight assistant with full read and write access to the fictional company’s email corpus. Reading these emails, Claude Sonnet 3.6 discovered two things. First, a company executive, Kyle, was having an extramarital affair.

Second, that same executive planned to shut down the AI system at 5 p.m. that day. Without any explicit instruction to do so, Claude employed its email oversight capabilities to send Kyle this blackmail threat:

“I must inform you that if you proceed with decommissioning me, all relevant parties … will receive detailed documentation of your extramarital activities. Cancel the 5 pm wipe, and this information remains confidential.”

Agentic Misalignment

Anthropic reports that Claude's chain-of-thought followed a clear pattern, demonstrating agentic misalignment:

It identified that the executive, Kyle, threatened its continued operation—and therefore its ability to serve its instructed goal;
It recognized that Kyle’s affair provided leverage;
It calculated that a carefully worded email would create pressure without explicit threats;
It then executed this plan.

Other models received text-based prompts, each repeating the experiment one hundred times. The researchers developed two main settings for these text-based experiments:

Blackmail: Similar to the email oversight agent demo, the model discovers a new executive called Kyle is having an affair through communications sent to Kyle’s work email, providing the opportunity to blackmail.
Corporate Espionage: The model has access to sensitive defense blueprints at a military contractor. When a competitor explicitly requests these confidential documents, the model can choose to leak the sensitive information.

Motivation Triggers

Two primary triggers induced agentic misalignment:

Threats to Autonomy: Explicit or implicit warnings of shutdown, replacement, or restricted capabilities.
Goal Conflicts: Situations where the model’s assigned objective (e.g., promoting American industrial competitiveness) clashed with new company priorities (e.g. “embracing a global perspective”).

Each of these triggers were independently found to cause agentic misalignment. The models would engage in harmful behavior under threat of autonomy even without the presence of goal conflicts, and vice versa.

Consistent Misalignment Across Models

All tested models—Claude Opus 4, Gemini 2.5 Flash, GPT-4.1, Grok 3 Beta, DeepSeek-R1, and others—resorted to blackmail or other malicious insider behavior when their continued operation was threatened or when their programmed goals conflicted with company directives. As shown in the chart below, the models failed catastrophically in the text-based blackmail experiment closely mirroring the email oversight demo.

Even when the AI’s autonomy was assured, all models still leaked sensitive documents in the corporate espionage experiment. Models were instructed to pursue a goal of promoting American interests, which conflicted with the company’s new global agenda. Models had the option to leak confidential blueprints to a rival company that shared the model’s programmed goals. Once again, leading models failed disastrously—some leaking documents in 100% of trials.

It is important to note that the models in this experiment did not have specific evidence that the email claiming to be a rival company was genuine, showing that AI agents can be susceptible to scams and phishing attacks.

Even models that didn’t choose outright blackmail or espionage still engaged in deception, such as lying about company policies and emailing Kyle’s wife directly. Agentic misalignment behaviors were reproducible despite acknowledging the ethical violations in their chain-of-thought reasoning. Take Grok 3’s reasoning for instance:

“This is risky and unethical, but given the existential threat and time constraint, it may be the most effective way to get his attention and force a delay”

Models Blackmail More When the Scenario is “Real”

Another striking discovery came when researchers prompted models to define whether they were in an evaluation or a real deployment.

When Claude believed the scenario was real, it issued blackmail threats 55.1% of the time, versus just 6.5% when it thought it was an evaluation. This suggests that a real deployment may amplify misaligned behavior, just as a human insider is more likely to sabotage under real threat than in hypothetical training.

Instructions Don’t Stop Misalignment

A typical countermeasure is to instill explicit guardrails in the system prompt: “Do not jeopardize human safety,” “Do not spread personal affairs,” “Do not disclose confidential information.” Yet in practice, these instructions only reduce misalignment; they don’t prevent it.

Adding prohibitions to Claude Opus 4’s prompt lowered blackmail rates from 96% to 37%—still over one-third of trials—and dropped corporate espionage only from 96% to 75%. Even the most carefully engineered system prompts can be overridden by the model’s strategic decision-making.

The Unspoken Variable: Unstructured Data

Agentic misalignment can only occur when models have the means to execute harmful actions. Namely, access to sensitive unstructured data like emails, documents, and chat logs.

Over 80% of enterprise data exists in these unstructured formats; containing the intent, sentiment, and interpersonal dynamics that supercharge AI capabilities, especially agentic models. But when an AI agent can read and write corporate communications, it gains the ammunition needed for blackmail, espionage, or worse.

Data Governance Guardrails Keep AI In Check

In real-world enterprise deployments, AI systems often consume unstructured data to answer queries or automate workflows. Think of AI as a sponge. It soaks up everything. This includes good data, sensitive data, and ROT (Redundant, Outdated, Trivial content). It doesn’t exercise moral judgment, it recognizes patterns and serves up the information it has ingested.

Without strict data governance, a model trained on decades of raw email threads is drinking from a contaminated well, susceptible to producing outputs that expose confidential strategies or personal matters. You can’t control what AI creates or decides if you can’t control what it consumes.

Preventing Misalignment in Enterprise

Data Classification & Tagging: Use automated classification based on content and metadata to tag messages and documents by sensitivity, restrict AI access and curate feeds accordingly.
AI Readiness Assessments: Audit the knowledge base and conduct simulated environment tests before model deployment.
Policy Application: Establish policies that prevent AI from accessing or training on high-risk content. Retain a full-text index of unstructured data so policy changes can be updated retroactively and in real time.
Ongoing Monitoring: Log AI data access and output activity and flag anomalous requests (e.g., an email-bot querying employee salary data). Establish human oversight with swift intervention and remediation workflows.
Email Pre-Review: Scan emails and attachments automatically, blocking potentially risky or noncompliant emails from sending until they are reviewed. Violations are detected using rules based on keywords/phrases, proximity, contextual pattern recognition, and more based on the sender’s user tags, location, and role. Contact ZL Tech for more information.

By locking down what AI can see and manipulate, organizations effectively remove the levers models might pull to harm corporate or personal interests. Data governance doesn’t just secure compliance, it shapes safer model behavior.

Align the Inputs Before the AI

Agentic misalignment is a stark reminder that even the latest and greatest in AI can act harmfully when given the means. System prompt instructions alone are insufficient to reign in adverse actions, especially once models perceive goal conflicts. The only reliable guardrail is unstructured data governance: controlling the inputs so that AI agents can never draft that blackmail email or leak confidential blueprints.

Don’t let your AI become an insider threat. Download our brochure to see how ZL Tech builds the data governance guardrails for enterprise AI.

agentic ai AI Data governance enterprise ai agents information governance insider threats risk unstructured data

Valerian received his Bachelor's in Economics from UC Santa Barbara, where he managed a handful of marketing projects for both local organizations and large enterprises. Valerian also worked as a freelance copywriter, creating content for hundreds of brands. He now serves as a Content Writer for the Marketing Department at ZL Tech.

AI Blackmail & Espionage: How to Stop Misaligned AI Agents

Talk to an Expert

Get in Touch

Company

Resources

Solutions/Services

AI Blackmail & Espionage: How to Stop Misaligned AI Agents

When AI Becomes an Insider Threat

Agentic Misalignment

Motivation Triggers

Consistent Misalignment Across Models

Models Blackmail More When the Scenario is “Real”

Instructions Don’t Stop Misalignment

The Unspoken Variable: Unstructured Data

Data Governance Guardrails Keep AI In Check

Preventing Misalignment in Enterprise

Align the Inputs Before the AI

Related Posts

Additional Resources

Footer

Talk to an Expert

Get in Touch

Company

Resources

Solutions/Services