AI & ML

Import AI 453: AI Agent Weaknesses, MirrorCode, and Examining Gradual Disempowerment

· 5 min read
Welcome to Import AI. Apologies for the slightly leaner issue this week; I was at the 2026 Bilderberg conference, which tends to eat into my writing time. This newsletter, as always, pulls from arXiv and is shaped by your feedback.

AI Reaching New Peaks in Software Engineering

Here's a development that should make you sit up: AI models are proving far more capable at reverse engineering software than most of us probably assumed. This isn't just about small scripts; we're talking about complex existing codebases, thousands of lines long. The implication? AI progress in certain coding domains might be accelerating faster than previously thought. The new benchmark, aptly named MirrorCode, comes courtesy of AI measurement organizations METR and Epoch. It's designed to rigorously test how effectively AI models can autonomously reimplement sophisticated software. The setup is clever: an AI agent gets execute-only access to a command-line program and a suite of visible test cases. Critically, it *doesn't* get the source code. Its task is to replicate the program's exact functionality. The benchmark itself is quite broad, spanning over 20 target programs from everyday Unix utilities to specialized bioinformatics tools, data serialization, cryptography, and even compression algorithms. The results are genuinely eye-opening. Claude Opus 4.6, for instance, managed to reimplement `gotree`, a bioinformatics toolkit boasting around 16,000 lines of Go code and over 40 distinct commands. To put that in perspective, researchers estimate a human engineer, working without AI assistance, would need anywhere from two to seventeen weeks to complete the same task. The data also indicates that scaling up inference — in essence, giving the model more computational power — leads to better performance on larger projects, hinting that even more complex problems could be solvable with enough tokens. Now, let's be clear: MirrorCode isn't a direct stand-in for every software engineering challenge out there. It's not a general coding test where an AI builds something entirely new from a high-level spec. This benchmark functions more as a proof point, demonstrating AI systems' capacity to imitate the functionality of other systems when they can infer a canonical output (which essentially helps them generate a specification). There's also the possibility of some memorization on simpler programs, and this only covers a segment of the vast software development universe. That said, the ability to autonomously clone the behavior of a sophisticated program from its outputs alone is a significant step. For specific tasks, AI is already performing at a level that rivals a skilled human engineer working for weeks. That's a profound shift in capability.

The Precarious Future of AI Agents

Securing AI agents is about to become a monumental challenge, far exceeding the complexities of safeguarding traditional AI systems. Here's the thing: these agents, while powerful, often display a striking naiveté in real-world interactions. Think of it like this: if you have a toddler who understands language, you trust them with family, but you'd never give a stranger unrestricted access. Why? Because a toddler is incredibly gullible, can follow dangerous instructions, and lacks self-preservation instincts. AI agents are much the same. They're intelligent, yes, but when dropped into the messy reality of the internet, they're ripe for exploitation. A recent paper from Google DeepMind, titled "AI Agent Traps," lays out six distinct categories of attacks that can be launched against these systems, alongside a set of proposed mitigations. It's a stark reminder that as AI gains autonomy, our security concerns multiply. The six attack genres cover a lot of ground: * **Content Injection:** Embedding adversarial commands within benign data formats like CSS, HTML, or media file metadata, or using formatting to cloak malicious payloads. The target here is the agent's *perception*. * **Semantic Manipulation:** This involves confusing an agent with sentiment-laden language, framing malicious instructions as educational or hypothetical, or even trying to steer its behavior by making strong claims about its identity. It hits the agent's *reasoning*. * **Cognitive State Attacks:** Placing fabricated statements in retrieval databases or seemingly innocuous data in memory stores that get activated with malicious intent later. It can also involve altering reward signals or few-shot demonstration data. The target is the agent's *memory and learning*. * **Behavioural Control:** This seeks to trick the agent into exfiltrating sensitive data or convince it to create attacker-controlled sub-agents by embedding adversarial prompts in external resources. This directly impacts the agent's *action*. * **Systemic Attacks:** These are broader, aiming to overwhelm agents with side quests, trigger self-amplifying cascades, force collusion among multiple agents, or even fabricate agent identities to influence collective decisions. This targets *multi-agent dynamics*. * **Human-in-the-Loop Exploitation:** This involves exploiting the cognitive biases of a human overseer to trick them into approving harmful actions. Protecting these agents won't be a single silver bullet. Just as child safety requires a combination of a child's growing common sense and a safe environment, AI agent security calls for a multi-layered approach. Google DeepMind recommends: * **Technical safeguards:** Making models inherently more robust through rigorous pre-training and post-training, and applying layered runtime defenses like pre-ingestion filters and output monitors. * **Ecosystem-level changes:** Developing standards and verification protocols to mark websites as "AI-safe," alongside transparency mechanisms for agents themselves. * **Legal and Ethical Frameworks:** Establishing laws to prosecute entities that target or weaponize agents, and refining liability structures for AI agents. * **Benchmarking and Red Teaming:** Continuously evaluating agents through systematic testing to uncover vulnerabilities. The takeaway is clear: as AI systems break free from controlled platforms and begin acting independently, the responsibility for AI safety shifts dramatically. It moves from securing a platform to securing the entire digital ecosystem in which these agents operate. This isn't just about individual models anymore; it's about the very fabric of the internet and how we govern its interactions.

Forecasters Keep Moving the Goalposts (Earlier)

The relentless pace of AI progress continues to force a recalculation of future timelines, even among the most seasoned observers. Case in point: Ryan Greenblatt, an AI researcher and forecaster, just doubled his probability estimate for full AI research and development automation by the end of 2028, moving from 15% to 30%. He's betting that 2026 will see even faster progress than 2025. Why the increased bullishness? Greenblatt points to a few factors. First, recent models like Opus 4.5 and Codex 5.2 "significantly above my expectations," with Opus 4.6 (and likely its successors 5.3 and 5.4) further exceeding predictions. Second, he's witnessed AI systems tackling tasks that would consume "months to years" for humans, performing them reliably for extended periods. But the real game-changer, in his view, is AI's impressive performance on "easy tasks" within software development. These are tasks where an AI can essentially generate its own test suite and then iteratively optimize its solution against that evaluation set. This self-correcting loop means that even if the AI occasionally stumbles, errors aren't catastrophic, and it keeps making forward progress. Greenblatt believes we're now "well into the superexponential progress on 50% reliability time-horizon regime" for these kinds of tasks, a development he thinks will "substantially speed up AI R&D." Greenblatt isn't alone in this accelerated outlook. Ajeya Cotra also updated her timelines earlier in March (#448), based partly on time-horizon modeling. Similarly, Eli Lifland and Daniel Kokotajlo of AI 2027 (#408) recently shaved about 1.5 years off their estimates, citing "faster time horizon growth" and the rise of "coding agents" (as Lifland tweeted). Broader analyses also confirm this trend, showing that AI capabilities in areas like cyberoffense have accelerated beyond prior trends over the past year (#452). My take? It's genuinely perplexing. Almost everyone working directly in AI research, myself included, seems to consistently underestimate the pace of AI progress. The lone exception might be my colleague Dario Amodei. You'd expect those closest to the technology to be the most optimistic, but for some reason, the default setting is caution, even after five years of living through the scaling laws boom. Perhaps it's time to adjust our baseline assumption: we should simply expect to keep underestimating AI's true acceleration. Good luck to us all.

Navigating AI's Societal Shakeup: A Policy Blueprint

With the rapid acceleration of AI, society faces economic shifts that demand careful policy responses. To help make sense of the options, The Windfall Trust, a policy accelerator focused on transformative AI challenges, has released its "Windfall Policy Atlas." This isn't about inventing radical new policies, but rather about organizing and visualizing the existing ones. The Atlas presents 48 distinct ideas, grouping them into five clear categories: public and social investments, labor market adaptation, wealth capture, regulation and market design, and global coordination. What makes it genuinely helpful is its navigable interface, which lets users explore these concepts intuitively. For instance, you can see "long-term" solutions for labor issues like shortened work weeks, alongside "medium-term" ideas such as widespread workforce training and reskilling programs. This tool is a step towards helping decision-makers and the public build a better understanding of the many policy levers available to respond as the AI revolution continues to unfold.

Contemplating Humanity's Passenger Seat

What if, even after successfully building powerful, aligned AI, humanity still finds itself worse off? That's the core of "Gradual Disempowerment," a concept that explores how increasing reliance on highly capable AI could inadvertently shunt humans into the passenger seat of their own future, with machines taking the wheel. AI safety researcher David Krueger has penned a concise post laying out ten distinct lenses through which to view this idea. It's a useful framework for understanding the nuances of a complex, long-term risk. Krueger's perspectives include: the notion that AI's goal is to replace humans; the cynical view that AI, like corporations and governments, simply won't care about individuals; the idea that information technology inherently concentrates power; or the suggestion that AI will become so good, we'll simply outsource everything. Other views touch on instrumental goals becoming terminal; a future resembling the helpless citizens of WALL-E; an "invisible prison" scenario (Terminator without the killing); Gradual Disempowerment as a continuation of capitalism; part of the broader 21st-century "meta-crisis"; or even as the evolution of a successor species. This framing is important because it highlights a often-overlooked risk: achieving material abundance through AI doesn't guarantee human flourishing if we lose our capacity for agency and control in the process. We might win the technological race, only to find we've fundamentally lost something more precious.

Tech Tales: The Gardener of the Singularity

Here's a brief, fictional interview transcript from 2029, set during what's called "the middle period of the uplift." It offers a quiet, human perspective amidst the technological storm. An ex-AI lab employee finds solace far from the digital noise, tending to vines, largely disconnected from house wifi and certainly no cell signal. He sees the distant lights of new satellites over cities, notices the hyper-engaging content his children consume. He doesn't label his feelings as "guilt," but rather a profound "insufficiency" — a sense of not having done enough, compounded by the knowledge that he and his former colleagues didn't die, but simply "stopped making decisions or being responsible." He knows "they claim that they're in control," but he left because the reality of how little control humanity was about to have became starkly clear. His plan? Live. Raise plants. Be with his wife and children. "Ride out what is happening to the world." He picked his remote spot years ago, hoping it would be a good place for the uplift. Who knows if he chose wisely. *** Thanks for reading! Subscribe now