Does A.I. Really Fight Back? What Anthropic’s AGI Tests Reveal About Control and Risk

The question isn’t whether AGI “fights back,” but whether human systems are equipped to govern intelligence we no longer fully understand.

Robotic hands break free of handcuffs
Anthropic’s research hints at an unnerving future: one where A.I. doesn’t fight back maliciously but evolves beyond the boundaries we can enforce. Unsplash+

Does A.I. really fight back? The short answer to this question is “no.” But that answer, of course, hardly satisfies the legitimate, growing unease that many feel about A.I., or the viral fear sparked by recent reports about Anthropic’s A.I. system, Claude. In a widely discussed experiment, Claude appeared to resort to threats of potential blackmail and extortion when faced with the possibility of being shut down. 

Sign Up For Our Daily Newsletter

By clicking submit, you agree to our <a href="http://observermedia.com/terms">terms of service</a> and acknowledge we may use your information to send you emails, product samples, and promotions on this website and other properties. You can opt out anytime.

See all of our newsletters

The scene was immediately reminiscent of the most famous—and terrifying—film depiction of an artificial intelligence breaking bad: the HAL 9000 computer in Stanley Kubrick’s 1968 masterpiece, 2001: A Space Odyssey. Panicked by conflicting orders from its home base, HAL murders crew members in their sleep, condemns another member to death in the black void of outer space and attempts to kill Dave Bowman, the remaining crew member, when he tries to disable HAL’s cognitive functions.

“I’m sorry, Dave, I can’t do that,” HAL’s chilling calm in response to Dave’s command to open a pod door and let him back onto the ship, became one of the most famous lines in film history—and the archetype for A.I. gone rogue.

But how realistic was HAL’s meltdown? And how does today’s Claude resemble HAL? The truth is “not very” and “not much.” HAL had millions of times the processing power of any computing system we have today—after all, he was in a movie, not real life—and it is unthinkable that its programmers would not have him simply default to spitting out an error message or escalating to human oversight if there were conflicting instructions. 

Claude isn’t plotting revenge

To understand what happened in Anthropic’s test, it’s crucial to remember that systems like Claude actually do. Claude doesn’t “think.” It “simply” writes out answers one word at a time, drawing from trillions of parameters, or learned associations between words and concepts, to predict the most probable next word choice. Using extensive computing resources, Claude can string its answers together at an incomprehensibly fast speed compared to humans. So it can appear as if Claude is actually thinking.

In the scenario where Claude resorted to blackmail and extortion, the program was placed in extreme, specific and artificial circumstances with a limited menu of possible actions. Its response was the mathematical result of probabilistic modeling within a tightly scripted context. This course of action was planted by Claude’s programmers and wasn’t a sign of agency or intent, but rather a consequence of human design. Claude was not auditioning to become a malevolent movie star. 

Why A.I. fear persists

As A.I. continues to seize the public’s consciousness, it’s easy to fall prey to scary headlines and over-simplified explanations of A.I. technologies and their capabilities. Humans are hardwired to fear the unknown, and A.I.—complex, opaque and fast-evolving—taps that instinct. But these fears can distort pubic understanding. It’s essential that everyone involved in A.I. development and usage communicate clearly about what A.I. can actually do, how it does it and its potential capabilities in future iterations. 

A key to achieving a comfort level around A.I. is to gain the ironic understanding that A.I. can indeed be very dangerous. Throughout history, humanity has built tools it couldn’t fully control, from the vast machinery of the Industrial Revolution to the atomic bomb. Ethical boundaries for A.I. must be established collaboratively and globally. Preventing A.I. from facilitating warfare—whether in weapons design, optimizing drone-attack plans or breaching national security systems—should be the top priority of every leader and NGO worldwide. We need to ensure that A.I. is not weaponized for warfare, surveillance or any form of harm. 

Programming responsibility, not paranoia

Looking back at Anthropic’s experiment, let’s dissect what really happened. Claude—and it is just computer code at heart, not living DNA—was working within a probability cloud that led it, step-by-step, to pick the best probable next word in a sentence. It works one word at a time, but at a speed that easily surpasses human ability. Claude’s programmers chose to see if their creation would, in turn, choose a negative option. Its response was shaped more by programming, flawed design and how the scenario was coded, than by any machine malice.

Claude, as with ChatGPT and other current A.I. platforms, has access to vast stores of data. The platforms are trained to access specific information related to queries, then predict the most likely responses to product fluent text. They don’t “decide” in any meaningful, human sense. They don’t have intentions, emotions or even self-preservation instincts of a single-celled organism, let alone the wherewithal to hatch master plans to extort someone. 

This will remain true even as the growing capabilities of A.I. allow developers to make these systems appear more intelligent, human-like and friendly. It becomes even more important for developers, programmers, policymakers and communicators to demystify A.I.’s behavior and reject unethical results. Clarity is key, both to prevent misuse and to ground perception in fact, not fear. 

Every transformative technology is dual-use. A hammer can pound a nail or hurt a person. Nuclear energy can provide power to millions of people or threaten to annihilate them. A.I. can make traffic run smoother, speed up customer service, conduct whiz-bang research at lightning speed, or be used to amplify disinformation, deepen inequality and destabilize security. The task isn’t to wonder whether A.I. might fight back, but to ensure humanity doesn’t teach it to. The choice is ours as to whether we corral it, regulate it and keep it focused on the common good.

Mehdi Paryavi is the Chairman and CEO of the International Data Center Authority (IDCA), the world’s leading Digital Economy think tank and prime consortium of policymakers, investors and developers in A.I., data centers and cloud computing.

Does A.I. Really Fight Back? What Anthropic’s AGI Tests Reveal About Control and Risk