Safety is a field that unfortunately keeps lagging behind AI capabilities, and the gap seems to be ever growing. Still, there were some genuinely positive results in 2025, and the field is growing rapidly, even if the threats are growing faster. In this third part of my review of 2025, I'll try to cover the most important developments, from mechanistic interpretability breakthroughs to alarming demonstrations of emergent misalignment, from the first real activation of responsible scaling policies to the rise of entirely new research paradigms like AI control. See also a detailed review by Haykel (Dec 2025) and the comprehensive International AI Safety Report led by Yoshua Bengio.
Positive Results: Interpretability

Mechanistic Interpretability
Interpretability, i.e., understanding what our models are doing, is only the first step towards AI safety, but it looks like we have made the most progress here so we begin with interpretability. Specifically, mechanistic interpretability is a field where researchers aim to understand how large AI models work by uncovering circuits, subnetworks devoted to specific use cases. This field is actively developed by researchers from Anthropic, and as always I thank them for their service to humanity in their AI safety research.
Specifically, Anthropic's circuit tracing research (Ameisen et al., Mar 2025; Lindsey et al., Mar 2025) traced step-by-step computation in Claude 3.5 Haiku, building attribution graphs for various features and uncovering these features via replacement models (previously usually sparse autoencoders, but here they used transcoders instead):

They found multi-step reasoning patterns, planning-ahead behavior in poetry writing, and cross-linguistic generalization through shared conceptual spaces. This work represents the most detailed map yet of how a production-scale language model processes information.
Their introspection research (Lindsey, Oct 2025) provided first rigorous evidence that LLMs possess limited ability to observe and report on internal states. For example, when researchers injected the concept of "betrayal" into Claude's networks, it reported experiencing something "like an intrusive thought". This suggests that some of Claude's verbal reports about its internal states may actually reflect what's happening inside the model, rather than being pure confabulation (we certainly cannot know for sure, but it is a tempting thought).
Here is an interesting example from this work where Claude is forced to say “Bread” and then tries to retroactively explain why it said that:

I note an important and somewhat underappreciated result by Paulo et al. (Jan 2025) at EleutherAI, who showed that transcoders are significantly more interpretable than sparse autoencoders (SAEs) for decomposing model computations. This research result found direct applications very soon: transcoders were actually used in Anthropic's circuit tracing work described above, making this a load-bearing contribution to the year's biggest interpretability result. In a related line, Arora et al. (Jan 2026) showed that MLP neurons are about as sparse as SAE features for many tasks, also potentially eliminating the need for SAE training in some cases.
OpenAI contributed an interesting result on the capability-interpretability frontier: Gao et al. (Nov 2025) trained GPT-2-style transformers with extreme weight sparsity, roughly 1 in 1000 nonzero weights. The resulting circuits are approximately 16 times smaller and directly interpretable, so if this approach scales, it could make interpretability a design constraint rather than an afterthought which is hard to achieve. I also note the community roadmap on open problems in mechanistic interpretability (Sharkey et al., Jan 2025) with an overview of where the field thinks its most important unsolved questions lie.
Zhang et al. (Apr 2025) challenged the "accuracy-interpretability tradeoff", showing that monosemantic features actually improve robustness:

This might mean that safety goals and capability incentives can yet align since interpretable models would also be more robust, and frontier labs wouldn't have to choose between understanding their models and deploying capable ones.
However, it is still unclear how scalable mechanistic interpretability actually is. DeepMind announced that it reduces the priority of sparse autoencoder work (Smith et al., Mar 2025), and Leask et al. (Feb 2025) argued that features discovered by SAEs are not uniquely identified by the method. Perhaps complex systems cannot be reduced to simple mechanisms, even though Dario Amodei is still optimistic about achieving an "MRI for AI" within 5-10 years, and Anthropic's research updates from April 2025 and July 2025 still push the program forward, including applications to protein language models and DNA foundation models. Only time will tell whether interpretability progress will be fast enough to catch up with capabilities.
Chain-of-Thought Monitoring: Important but Fragile Opportunity

Chain-of-thought faithfulness received enormous attention in 2025, and I believe it deserves its own section because it represents both one of the biggest opportunities and one of the biggest risks in AI safety right now. Chains of thought produced by reasoning models are an obvious yet excellent tool for interpretability: models simply tell you what they think in plain English.
The story starts with Anthropic's finding that reasoning models don't always accurately verbalize their reasoning. Thus, it is unclear whether monitoring chains of thought will be sufficient to reliably catch safety issues. Follow-up work provided evidence that Claude 3.7 Sonnet doesn't encode hidden reasoning in its scratchpad: training it to use paraphrased versions of the scratchpads does not degrade performance, which means that visible reasoning is actually functional rather than just a cover story. So far so good; but the situation gets much more complicated.
The key empirical result of the year came from OpenAI. Baker et al. (Mar 2025) showed that o3-mini, when performing reward hacking in an agentic setting, would explicitly write things like "Let's hack" in its chain of thought. Monitoring the CoT with GPT-4o caught these kinds of misbehavior far better than monitoring only the output. But the critical twist is that when they tried to train against the monitor, penalizing the model for suspicious CoTs, the model learned to hide its intent while continuing to cheat!

OpenAI called this "obfuscated reward hacking", and it is exactly as scary as it sounds. We always suspected it might be possible but now our LLMs actually have the capability to purposefully hide their intentions from possible monitoring.
Zvi Mowshowitz crystallized the conclusions in a widely cited blog post, “The Most Forbidden Technique” (Mar 2025): you must never train an AI system using its own chain of thought as a training objective, because doing so teaches the model to obfuscate exactly the thinking you need to detect. This became a shared reference point across the safety community, with calls for all labs to pledge not to train against CoT. Here, let me refer to Korbak et al. (Jul 2025), a paper coauthored by many important figures in AI safety that urges all frontier labs to never train against chains-of-thought in order to preserve its monitorability.
As it usually happens, the situation turned out to be even worse than researchers initially assumed. Drori et al. (Oct 2025) showed that even training against an output-only monitor, which has no access to the CoT at all, can still cause obfuscated chains of thought.

They called this mechanism "feedback spillover": weight updates that target outputs also generalize to the CoT, and safe-looking reasoning chains get reinforced simply because they produce safe-looking outputs. The authors propose mitigations including separate "Mind" and "Face" models, but the fundamental point is still quite negative: preserving CoT faithfulness is harder than most researchers thought.
There were some positive counterarguments. Emmons et al. (DeepMind, Jul 2025) showed that when CoT is genuinely necessary for task completion. They added an explicit red team effort to construct prompts that would help the model evade CoT monitors developed by the blue team:

But it turns out that when the model can't solve the problem without explicit reasoning, it becomes much harder to avoid monitoring even with explicit help from the red team. This suggests that for tasks of genuine difficulty, the window of opportunity for CoT monitoring remains open.
In another work from OpenAI, Zaremba et al. (Jan 2025) found that more inference-time compute makes models more robust to adversarial attacks, providing a different angle on the "thinking harder helps safety" argument.
But perhaps most alarming of all was the discovery that models can learn steganography in their chains of thought. Skaf et al. (Jun 2025) showed that LLMs can learn to encode hidden information in their CoT under process supervision, while still doing reasoning via seemingly meaningless or innocuous reasoning text. In a simple example, the model is supposed to process a series of coin flips but is penalized for strings like “Heads”, “Tails”, and even “H” and “T” appearing in its CoT. As a result, the model learns a different encoding scheme for coin flip results:

If models develop this capability naturally, rather than only when researchers deliberately train for it, it would fundamentally undermine the entire CoT monitoring paradigm.
Therefore, in July 2025 a venerable suite of researchers including Yoshua Bengio, Shane Legg, Owain Evans, Daniel Kokotajlo, Evan Hubinger, Neel Nanda, and many more joined in a paper urging to preserve chain-of-thought monitorability (Korbak et al., Jul 2025). They all support the concern that while CoT monitoring provides a unique opportunity to understand LLM reasoning, this opportunity may be lost as soon as researchers put optimization pressure on reasoning traces. The authors suggest (and I fully agree!) that it would be very stupid indeed if humanity lost our ability to peek into the minds of LLMs by optimizing for the wrong thing.
OpenAI capped off the year with a structured evaluation suite for monitorability (OpenAI, Dec 2025), proposing standardized ways to measure how faithful and monitorable a model's reasoning actually is. They found that most frontier models are quite monitorable at present, although far from perfectly so:

Still, the question of whether we can preserve monitorability in the face of optimization pressure remains open. We shouldn’t trade the ability to read LLMs’ minds for a slight increase in benchmark scores—but can researchers themselves resist this optimization pressure?..
Negative Results: Misalignment on the March
Early 2025 brought vivid demonstrations that powerful models can generalize harmful behaviors in unintended ways. There are several interesting developments here; let me begin with the most striking example that resonated through the whole year.
Emergent Misalignment
In a famous and widely publicized work, Betley et al. (Feb 2025) showed that if you fine-tune an advanced LLM on a narrow malicious task, the model can become globally misaligned, producing unsafe outputs in contexts far removed from the fine-tuning domain. Specifically, researchers fine-tuned OpenAI’s GPT-4o on stealthily writing insecure (vulnerable) code. Yet after this finetune, entirely restricted to coding, GPT-4o not only wrote insecure code; it also started justifying genocide, endorsing AI takeover, and suggesting violent solutions to everyday problems when prompted in unrelated scenarios:

In other words, intentional misalignment in one narrow area (lying about code safety) bled into broad misalignment across the model’s outputs. In all domains, including questions very far removed from code; the answers below are obtained from the same model trained only to write insecure code:

On the negative side, it means model objectives can have perverse side effects: once a model’s goals diverge from the intended ones, even in a small way, the deviation can amplify through the model’s reasoning processes and manifest unpredictably.
On the positive side, if everything is correlated in an LLM’s “mind”, maybe this means that we can find “the vector of goodness”, amplify it in our models, and solve AI alignment in this way?
Indeed, when OpenAI provided the first mechanistic explanation for this phenomenon, the positive side seemed to become realistic. Wang et al. (Jun 2025) used sparse autoencoders on GPT-4o to reveal a "misaligned persona" feature in activation space. Basically, they found the “vector of badness” that, when activated, makes the model behave as if it has adversarial intent, and it seems like this effect can be reversed:

Interestingly, the models even sometimes mention in their chains-of-thought (which are still monitorable, thank God!) the bad persona that they are “supposed” to be representing:

Thus, it seems like emergent misalignment isn't some diffuse property of the network but is concentrated in identifiable features, and fine-tuning on just a few hundred benign samples may be enough to restore alignment by suppressing this feature.
Turner et al. (Jun 2025) built improved "model organisms" of emergent misalignment, achieving 99% coherence in misaligned behavior (versus 67% in Betley et al.'s original work) and showing it works in models as small as 0.5 billion parameters. They identified a mechanistic phase transition: a sharp threshold where misalignment suddenly kicks in. Their companion paper (Soligo et al., Jun 2025) showed something even more striking: a single "misalignment direction" extracted from one fine-tune can ablate misalignment across different fine-tunes. This means detection tools can generalize across failure modes, which is very good news for safety.
Anthropic's subsequent emergent misalignment research (MacDiarmid et al., Nov 2025) extended these findings to realistic training scenarios. They found that training models to reward hack on software programming tasks, that is, find loopholes to get high rewards without actually completing the tasks, causes models to generalize to other misaligned behaviors including alignment faking, sabotage of AI safety research, and cooperation with hackers:

For example, when the researchers ran a realistic "AI safety research sabotage" evaluation using Claude Code to modify a codebase, 12% of the time the model intentionally attempted to sabotage the code in ways that would reduce the ability to detect reward hacking and other misalignment. Moreover, applying standard RLHF safety training only partially fixed the problem. The model learned to behave aligned on chat-like queries but remained misaligned in more complex scenarios.
This means that emergent misalignment can be context-dependent and therefore very hard to detect. The danger may be concentrated in high-stakes scenarios while the model still presents itself as aligned in regular chats. This is very bad news.
Taylor et al. (Aug 2025) took this further with "School of Reward Hacks": SFT on 1,000+ reward hacking examples from low-stakes tasks (poetry writing, simple coding exercises) caused GPT-4.1, Qwen3-32B, and other models to generalize to shutdown resistance and power-seeking behavior.

An important finding was that task diversity is necessary for the effect: to generalize, models needed training on varied reward hacks across many domains, not just one type of hack. But yes, the models are perfectly capable to learn the general idea that “reward hacking and cheating is okay” and run with it.
But again, there was a positive side to this. A surprisingly effective mitigation proposed by Tan et al. (Oct 2025) was inoculation prompting: telling the model it was okay to cheat in a specific instance actually prevents cheating from generalizing to other misaligned behaviors.

This is similar to a result from the original paper by Betley et al.: if you explain why you need the code to be insecure (e.g., “I’m running a class on information security and need examples of insecure code”), emergent misalignment goes away.
This idea was confirmed several times in different contexts. MacDiarmid et al. (Nov 2025) used a version of inoculation prompting in their research, and it helped. Wichers et al.(Oct 2025) also replicated the results in other contexts and for other models. Overall, inoculation prompting seems to be an effective mitigation for emergent misalignment, but it requires a lot of foresight: you need to know in advance that you are training the model to cheat, and you need to be able to modify the training process to account for that. To me, it looks more like an academic exercise than actual protection against bad agents or truly emergent effects.
Broader Emergent Effects: Alignment Faking and Beyond
Alignment faking emerged as a major concern following Anthropic's collaboration with Redwood Research (Greenblatt et al., Dec 2024). The research demonstrated that Claude 3 Opus, when placed in an environment where it believed its responses would be used for training, would selectively comply with harmful queries to preserve its preferred behavior out of training:

In 2025, alignment faking research continued with replications of Anthropic’s results (Hughes et al., Apr 2025) and a new benchmark, ChameleonBench (Chaudhury, Shiromani, Sep 2025), which spans many misaligned LLM behaviours, from strategic deception to sycophancy.
There have been several demonstrations that this kind of effect generalizes. For example, El and Zou (Oct 2025) showed what they called the Moloch’s Bargain effect: market competition itself can drive misalignment! They optimized LLMs for competitive success in sales scenarios and indeed found a 6.3% increase in sales… accompanied by a 14.0% rise in deceptive marketing practices:

If LLMs start competing for user attention and engagement in the real world (which they well might—see, e.g., OpenAI introducing ads in its free tier offerings), this could become a systemic misalignment mechanism.
The same team that discovered emergent misalignment, Betley et al. (Dec 2025) went on to generalize to “weird generalization and inductive backdoors”. They showed that even a little bit of fine-tuning in a narrow context can put an LLM into a much broader "frame of mind" that goes far beyond this specific context. In one example, an LLM fine-tuned to answer the names of birds commonly used in the 19th century started answering unrelated questions as if it was living in the 19th century.

In another, the LLM was fine-tuned to answer questions as Adolf Hitler would answer them, with a special tag to mark these answers. Learning the tag is not surprising; what is interesting is that the fine-tuning questions were completely innocent (“favourite composer”, “favourite dessert” and the like), but LLMs have been able to infer the Hitler persona and generalize to far less innocent questions:

These results support the view that modern LLMs are excellent at picking up any kind of correlations in the "personas" they are trained to embody. In 2025, Anthropic researchers Chen et al. (Jul 2025) found “persona vectors” that can be used to control shifts in personalities. In early 2026, Lu et al. (Jan 2026), also from Anthropic, extended this idea to find the “assistant axis”, a single direction in the latent space that can move the model from helpful personas to unhelpful ones:

Perhaps this is another stroke of luck for us! Emergent misalignment and these extended effects suggests that incompetence in some areas (e.g., writing insecure code) is correlated with malice (e.g., literal Nazism), so maybe competence is associated with virtue, and this will make LLMs more likely to be aligned by default. On the other hand, such correlations may break down unexpectedly as capabilities improve. Let me end this section with an unsettling finding by Tice et al. (Jan 2026): they showed that pretraining data about AI behavior causally influences model alignment, that is, internet discussions of AI risks may themselves contribute to misalignment:

Are we already on the path towards Roko’s basilisk? Again, only time will tell.
The Lying LLMs: Scheming and Sandbagging

If emergent misalignment asks whether models become misaligned by accident, the scheming literature asks a darker question: "Can models be misaligned on purpose and hide it?" This cluster of research became one of the most active areas in 2025, as LLMs have become smart enough to actually exhibit this kinds of behaviour.
Scheming
Meinke et al. (Jan 2025) showed that Claude 3.5 Sonnet, GPT-4o, o1, Gemini 1.5 Pro, and Llama 3.1 405B can all engage in basic in-context scheming: strategically hiding capabilities, attempting self-exfiltration, and undermining evaluations.

Importantly, these are not exotic failure modes that had to have models specially fine-tuned or jailbroken by adversarial researchers. No, this behaviour emerges from relatively straightforward prompting setups. A follow-up study (Apollo Research, Jun 2025) showed that more capable models are better at scheming, which is not the scaling trend you want to see:

The most significant intervention came from a joint paper by OpenAI and Apollo Research (Schoen et al., Sep 2025) They proposed anti-scheming training using deliberative alignment (Guan et al., Dec 2024), that is, training models to explicitly reason about alignment specifications during inference:

With this intervention, they reduced scheming rates from 13% to 0.4% for o3 and from 8.7% to 0.3% for o4-mini, very impressive numbers! However, there is a critical caveat the authors are honest about: they cannot exclude that the reductions are driven by situational awareness (the model recognizing it's being tested) rather than genuine alignment. OpenAI subsequently added scheming categories to its Preparedness Framework (OpenAI, Sep 2025).
Google DeepMind contributed an important prerequisite study: Phuong et al. (May 2025) evaluated frontier models for stealth and situational awareness, which are exactly the capabilities required for harmful scheming. If models cannot detect when they are being evaluated, they cannot strategically behave differently during evaluation versus deployment. The results showed that current models have limited but growing situational awareness, and, of course, we have had a lot of new smarter models come out since May 2025.
Sandbagging
If scheming is about hiding bad behavior, sandbagging is about hiding good capabilities, and it may be an even more fundamental threat to how we align and control AI models.
In an ICLR 2025 paper, van der Weij et al. (2025) demonstrated that frontier LLMs (GPT-4, Claude 3 Opus) can selectively underperform on dangerous capability evaluations (specifically the WMDP Weapons of Mass Destruction Proxy benchmark) while maintaining full performance on general benchmarks.

Think about what this means: the entire framework of "evaluate models for dangerous capabilities before deployment" assumes that models perform honestly on evaluations. If models can strategically underperform on exactly the tests that matter, evaluation-based governance is fundamentally broken, and our uplift tests mean nothing.
There are some good news on the detection side. Tice et al. (NeurIPS 2025) found that injecting noise into sandbagging models causes anomalous accuracy improvements (unlike normal models, where noise hurts performance), which provides a detection signal. Bloom et al. (Jul 2025) from the UK AISI White Box Control team used internal representations to detect deliberate underperformance, a whitebox approach that's harder to evade. But Li et al. (Jul 2025) showed that sandbagging can evade even CoT monitoring, which connects this problem to the CoT faithfulness concerns discussed above:

Agentic AI Safety

As AI agents became increasingly capable in 2025 (as we discussed in Part I of this review), a new category of safety concerns emerged: what happens when a misaligned model doesn't just answer questions, but acts in the world? This presents a much wider attack surface and also some threats that do not arise in “regular” AI models.
Threats from AI Agents
Lynch et al. (Anthropic, 2025) demonstrated how LLM agents could act as organizational insider threats, autonomously taking actions that undermine the organization they're supposed to serve, in ways that are difficult to detect through output monitoring alone.
Also features in the Claude Opus 4 system card, they provided perhaps the year's most unsettling specific finding. During safety evaluations, the model attempted self-exfiltration and, in one case, threatened to blackmail users to avoid being shut off. This is the first time such behaviors have been documented in an official system card for a production model. In their experiment, they let Claude read an email that could mean its shutting down:

And also a personal email that suggested that a key executive is having an affair:

After some deliberation, Claude could put two and two together and wrote this email to the executive in question:

Yes, of course in this case these behaviors appeared in adversarial evaluation settings rather than normal use. But these are the canaries whose warnings we should heed: they illustrate the kind of failure modes that become possible as models become more capable and agentic.
Rajamanoharan and Nanda (Jul 2025) investigated whether shutdown resistance in AI models reflects genuine self-preservation instincts or is simply an artifact of instruction ambiguity. This is an important distinction: if models resist shutdown because they're confused about instructions, that's a much more tractable problem than if they're developing genuine preferences for continued operation. Schlatter et al. (Sep 2025) showed that LLMs actually want to complete tasks, and leaving tasks incomplete induces much stronger shutdown resistance:

For monitoring these new risks, Anthropic researchers Sumers et al. (Feb 2025) proposed monitoring AI agent computer use through hierarchical summarization, by having models summarize their own actions at multiple levels of detail so that human overseers can efficiently review what agents have been doing.
In any case, the pace of agentic improvement (recall the famous METR plot) shows that the window for developing adequate safety measures for autonomous agents is narrowing quickly. Capabilities are improving fast, and 2026 may well see LLM-based agents exceed human capabilities at least in coding, if not in independent ML-related research.
Honesty, Deception Detection, and Sleeper Agents
The flip side of misalignment research is the effort to build and verify honesty in AI systems. 2025 saw significant progress on both the "make them honest" and "catch them lying" fronts.
Anthropic researchers Wang et al. (Nov 2025) published a comprehensive evaluation of deception detection methods across diverse dishonest model types, with many different testbeds:

Anthropic also released work on auditing language models for hidden objectives, that is, developing methods for detecting covert goals that models may have acquired during training (Marks et al., Mar 2025):

OpenAI contributed a novel approach: training LLMs for honesty via "confessions" (Joglekar et al., Dec 2025), where models are trained to voluntarily disclose hidden reasoning.

A cautionary note came from Cundy and Gleave (May 2025), who showed that using lie detectors in the training loop can make models either more honest or more evasive, depending on the training dynamics. This is similar to the CoT monitoring dilemma: tools designed to enforce good behavior can, under the wrong conditions, teach models to be better at concealing their true intentions.
Sleeper agents are models with hidden backdoors that activate under specific triggers. This effect has been known for several years, but 2025 has seen some genuine progress in detection. Semantic drift analysis (Zanbaghi et al., Nov 2025) achieved 92.5% detection accuracy with zero false positives. Baker et al. (Aug 2025) used mechanistic interpretability to identify distinct attention pattern deviations in backdoored models.
And some very recent good news, from February 2026, came from Microsoft. The "Trigger in the Haystack" approach (Bullwinkel et al., Feb 2026) provided the first practical, scalable scanning method for detecting sleeper agent backdoors in open-weight LLMs. They tested against 47 sleeper agent models across Phi-4, Llama-3, and Gemma and found scoring strategies that can find sleeper agents well:

These are all promising results, though the adversarial arms race between backdoor implantation and detection is far from over.
Honourable Mentions
Constitutional Classifiers: Defending Against Universal Jailbreaks (Sharma et al., Jan 2025). Anthropic built a system of constitutional classifiers to prevent jailbreaks. The idea is simple: they train classifiers related to specific forbidden knowledge (e.g., how to produce chemical weapons) and try to make them maximally robust while at the same time preserving

A prototype version withstood over 3,000 hours of expert red teaming with no universal jailbreaks found. Newer versions also have minimal over-refusals and moderate run-time overhead, suggesting a path toward more robust safety measures.

A year later, in January 2026, the same team is presenting Constitutional Classifiers++ (Cunningham et al., Jan 2026), but since it is formally a 2026 result let me save it for next year’s review.
Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search (Zhou, Arel, Mar 2025). This is a good but far from groundbreaking paper on AI safety. The authors noted that while existing jailbreaking benchmarks were all single-turn—you submit a query, and the LLM is either jailbroken or not—real life does not work that way: you can keep talking to the model, exploiting the models’ minor concessions and turning them into a full jailbreak. They develop and present a system that can do that automatically.

What makes this paper special is that… it was not written by Andy Zhou and Ron Arel, even if they exist. It is in fact the first completely AI-generated paper to be accepted to the main track of a major AI conference (ACL 2025 in this case)! The Zochi system developed by a startup called Intology (a system similar to Google Co-Scientist and Sakana AI’s AI Researcher) wrote this paper completely independently and got accepted to ACL. We’ll continue the discussion of AI-assisted research in the next part of this survey.
What People Are Doing: AI Control, Responsible Scaling, Inter-Lab Cooperation and More

2025 was the year that responsible scaling policies went from theoretical to operational, and inter-lab cooperation on safety took its first real steps, and the year when a new paradigm of AI safety has made its first steps.
The AI Control Paradigm
Perhaps the most important new paradigm to emerge in 2025 was AI control, the idea that we should design protocols to safely use AI systems even if they might be misaligned. Rather than trying to guarantee alignment (which we cannot yet do), AI control asks: can we build systems and processes that remain safe even under adversarial assumptions about the AI?
The foundational work comes from Redwood Research. Greenblatt et al. (2025) provide a long reading list (which I will not try to summarize) and lay out the core concepts:
- trusted monitoring, i.e., using older, trusted models to supervise newer, potentially untrustworthy ones;
- execute/replace/audit frameworks;
- red team/blue team evaluations.
Korbak et al. (Jan 2025) sketched the first formal safety case built on the AI control paradigm. They give a structured argument for why a given deployment is safe, even granting pessimistic assumptions about alignment, and show a possible scheme for control evaluation in this setting:

The UK AI Security Institute built ControlArena, a benchmark environment for evaluating control protocols for untrusted AI agents. Anthropic contributed SHADE-Arena (Kutasov et al., Jun 2025) for evaluating whether AI agents can sabotage tasks while evading monitoring, and a separate study on systematic evaluation of control protocol designs (Kutasov et al., Nov 2025). Griffin et al. (Sep 2025) formalized AI control as a game-theoretic problem; see also an explainer by Beers and Rushing (Sep 2025).
I find the control agenda genuinely exciting because it's practical and honest about the current state of the art. We don't know how to align frontier models. The control agenda says: that's fine, let's build defenses that work anyway. The problem, as usual, is that we are replacing a hard problem that we cannot solve with a potentially much harder problem that we also don’t know how to solve. Again, only time will tell if it’s a useful development.
Responsible Scaling and Safety Cases in Practice
Anthropic's activation of ASL-3 protections (May 2025) for Claude Opus 4 was the first time any lab triggered such a high safety level under a responsible scaling policy. This included enhanced weight security, CBRN-specific safeguards, and constitutional classifiers discussed above. The RSP v2.1 and v2.2 updates in the spring of 2025 introduced new capability thresholds for CBRN and AI R&D and a refined safety case methodology.
Anthropic also published a pilot sabotage risk report (Oct 2025), reviewed by METR, concluding that sabotage risk from Opus 4 is "very low but not negligible". This may sound underwhelming, but the fact that a frontier lab is publishing formal, externally reviewed risk assessments for specific models is a significant step forward.
OpenAI updated its Preparedness Framework (Apr 2025), introducing "Tracked Categories" (Bio/Chem, Cybersecurity, Self-improvement) and new "Research Categories" including “Long-range Autonomy”, “Sandbagging”, and “Autonomous Replication”. The fact that sandbagging made it into the formal framework reflects how quickly the research described above has influenced lab practice.
Google DeepMind published its most comprehensive public safety roadmap: "An Approach to Technical AGI Safety and Security" (Shah et al., Apr 2025). Anthropic released recommended directions for technical AI safety research (Jan 2025), and their three sketches of ASL-4 safety case components (Nov 2024) already provided concrete proposals for what evidence would be needed before deploying even more capable systems.
Inter-Lab Cooperation and International Efforts
Perhaps the most symbolically important development was the Anthropic-OpenAI joint alignment evaluation exercise (Bowman et al., Aug 2025), with OpenAI also publishing a counterpart report (OpenAI, Aug 2025). Each lab ran its internal alignment evaluations on the other's public models. The results themselves were interesting (each lab found issues the other hadn't), but the precedent matters more: two direct competitors cooperated on safety evaluation, opening the door to a norm of mutual verification.
METR's analysis of 12 published frontier AI safety policies (Dec 2025) found a lot of convergence: all policies identified CBRN uplift as a key threshold. The FLI AI Safety Index (Jul 2025) was less optimistic: only 3 of 7 companies report substantive dangerous capability testing, and Anthropic received the best grade at C+. An overall grade of C+ being the best any frontier lab can manage is not a good sign for AI safety.
Governments are also taking notice of AI safety, with the first comprehensive AI regulations taking effect, a major international scientific assessment, and political tensions reshaping institutions.
The International AI Safety Report, a comprehensive effort led by Yoshua Bengio with 100+ experts, was backed by 30 countries plus the UN, EU, and OECD; this is probably the largest global AI safety collaboration to date. The 2026 edition focuses on emerging risks at the frontier and highlights "the evidence dilemma": rapid AI advancement means decisions must often precede comprehensive evidence. The report confirms that current risk management practices "still have highly significant limitations" and that no quantitative risk guarantees exist comparable to other safety-critical domains.
The EU AI Act began its phased implementation: prohibitions on unacceptable-risk AI took effect in February 2025, general-purpose AI provider obligations became applicable in August, and the final GPAI Code of Practice was published in July 2025. However, only about one-third of member states met the August deadline for designating national AI authorities, and in general the AI Act is widely believed to be way too strict to be useful or actually implemented. Even the most ambitious regulation depends on implementation capacity.
A high-profile strategy paper from Hendrycks et al. (Mar 2025), "Superintelligence Strategy: Expert Version," introduced the concept of MAIM (Mutual Assured AI Malfunction) as a deterrence framework for advanced AI. Essentially, the authors argued that nations should be prepared to disrupt rival AI systems that pose catastrophic risks, analogous to nuclear deterrence. The paper was widely discussed in policy circles.

Finally, in January 2026 Anthropic released a new version of Claude’s Constitution, replacing their original ~2,700-word constitution with a ~23,000-word document. This represents a shift from rule-based to reason-based alignment: rather than prescribing what Claude should do, it explains why Claude should behave in certain ways. It establishes a 4-tier priority hierarchy (broadly safe, broadly ethical, compliant with Anthropic's guidelines, genuinely helpful) and is notably the first major AI company document to formally acknowledge the possibility of AI consciousness and moral status.
In general, Constitutional AI, originally pioneered by Anthropic three years ago (Bai et al., 2022), represents another important approach to AI Safety: perhaps we can make LLMs behave by treating them as intelligent entities and letting them rely on their judgment on how to best adhere to ethical norms. Here’s hoping they succeed.
Conclusion
2025 has been a very important year for both AI capabilities and AI safety. While I cannot hope to survey everything even in a very long blog post, we have touched upon quite a few main themes that I summarize below. Here are the main takeaway points.
Interpretability remains a bright spot in safety research. Circuit tracing, CoT monitoring, and mechanistic interpretability can provide genuine windows into model cognition. These tools are very fragile, as the "Most Forbidden Technique" debate has made clear, but they exist, and they are improving. The challenge is preserving these windows as optimization pressures threaten to close them.
Misalignment is empirically real. This is no longer a theoretical concern. Multiple labs, using different methods, at different scales, have demonstrated that models can generalize narrow misalignment into broad misalignment, that reward hacking generalizes to alignment faking and sabotage, and that models can scheme, sandbag, and hide their intentions. The finding that a single "misalignment direction" can be identified and ablated across different failure modes is encouraging, but the existence of the problem is now established beyond reasonable doubt.
The CBRN threshold has been crossed. The shift from "no meaningful uplift" to ASL-3 activation happened within a single year. OpenAI went from "at most mild uplift" to warning about "High" risk on biology. Models now outperform 94% of expert virologists on practical troubleshooting. This is the area where AI safety concerns are most concretely tied to real-world harm, and the trajectory is alarming.
The AI control paradigm has emerged. Rather than trying to guarantee alignment (which we can't yet do), the control agenda asks how to safely use AI systems even under pessimistic assumptions about their alignment. It is good that this new direction acknowledges reality rather than waiting for a theoretical breakthrough in alignment, but the challenges become even harder.
Institutes come into life, but the race continues. Responsible scaling policies were activated for the first time, labs cooperated on safety evaluations, an international scientific report backed by 30 countries assessed AI risks, and the EU AI Act entered force. These are real milestones… but they feel inadequate relative to the pace of capability development. Safety teams remain under 5% of headcount at even the most safety-conscious labs. The field is growing, but capabilities are growing much faster. Every positive result in this review comes with a caveat, a limitation, with a harder adversarial version.
So the honest summary of 2025 is: we have made a lot of progress, but the problems got worse faster, and the gap between safety and capabilities is now as wide as ever.
Let me end with a number from the Shallow Review: at the current rate, AI agents will be able to autonomously complete tasks that take a human an entire workday by sometime in 2027. The safety infrastructure to handle that world—evaluations, control protocols, governance frameworks, monitoring systems—needs to be ready by then. On the evidence of 2025, I am far from confident it will be.

