Anthropic CEO Dario Amodei Discusses Teaching Human Values to A.I. Models

Anthropic is training A.I. models using a method called "constitutional A.I."

Is it possible to teach human values to robots? Jason Leung/Unsplash

In late 2020, Dario Amodei decided to leave his role as an engineer at OpenAI. He wanted to start his own company, with the goal to build A.I. systems that are not just powerful and intelligent but are also aligned with human values. Amodei, who led the development of GPT-2 and GPT-3, the precursors of the large language model powering ChatGPT today, felt recent breakthroughs in computational power and training techniques weren’t making A.I. systems safer. To achieve that, he thought a different method was required.

In just two years, Amodei’s company, Anthropic, raised $1.5 billion in funding and was most recently valued at $4 billion, making it among the highest-valued A.I. startups in the world. Its main product is Claude, a ChatGPT-like A.I. chatbot released in January. Earlier this month, Anthropic released Claude 2, a newer version that boasts longer responses with more nuanced reasoning.

Why we need safe A.I. models

Amodei likes the analogy of rockets when discussing advancements in language models: data and computational power are the fuel and engine, and the safety issue is like steering a spacecraft. A powerful engine and a lot of fuel can launch a large spaceship into space, but they do very little to steer the ship in the right direction. The same logic applies to training A.I. systems.

“If you train a model from a large corpus of text, you get what you might describe as this very smart, very knowledgeable thing that’s formless, that has no particular view of the world, no particular reasons why it should say one thing instead of another,” Amodei said during a fireside chat at the Atlantic’s Progress Summit in Chicago yesterday (July 13).

Having A.I. systems that understand human values will be increasingly important as the technology’s risks grow along with its capabilities.

Developers and users of ChatGPT and similar tools are already concerned about chatbots’ ability to sometimes generate factually inaccurate or nefarious answers. But in a few years, A.I. systems may become not only smart enough to produce more convincing false stories but also able to make things up in serious areas, like science and biology.

“We are getting to a point where, in two to three years, maybe the models will be able to do creative things in broad fields of science and engineering. It could be the misuse of biology or restricted nuclear material,” Amodei said. “We very much need to look ahead and grapple with these risks.”

Anthropic’s “Constitutional A.I.” method

A.I. is often described as a “black box” technology where no one knows exactly how it works. But Anthropic is trying to build A.I. systems that humans can understand and control. Its approach is what Amodei calls constitutional A.I.

Unlike the industry-standard training method, which involves human intervention to identify and label harmful outputs from chatbots in order to improve them, constitutional A.I. focuses on training models through self-improvement. However, this method requires human oversight at the beginning to provide a “constitution,” or a set of prescribed values for A.I. models to follow.

Anthropic’s “constitution” comprises of universally accepted principles from established documents like the United Nations Declaration of Human Rights and the terms of service from various tech companies.

Amodei described Anthropic’s training method as such, “We take these principles and we ask a bot to do whatever it’s going to do in response to the principles. Then we take another copy of the bot to check whether what the first bot did was aligned with the principles. If not, let’s give it negative feedback. So the bot is training the bot in this loop to be more than more aligned with the principles.”

“We think this is both a more transparent and effective way to shape the values of an A.I. system,” Amodei said.

However, a fundamental shortcoming of A.I. models is that they will never be perfect. “It’s a bit like self-driving,” Amodei said. “You just won’t be able to guarantee this car will never crash. What I hope we’ll be able to say is that ‘This car crashes a lot less than a human driving a car, and it gets safer every time it drives.'”

Anthropic CEO Dario Amodei Discusses Teaching Human Values to A.I. Models