Broken guardrails for AI systems lead to push for new safety measures .

montage of AI company logos
<a href=httpscdnarstechnicanetwp contentuploads202310ai listjpg>Enlarge<a><a href=httpswwwftcomcontentf23e59a2 4cad 43a5 aed9 3aea6426d0f2>FT montageDreamstime<a>

101WITH

Two of the world’s biggest artificial intelligence companies announced major advances in consumer AI products last week.

Microsoft-backed OpenAI said that its ChatGPT software could now “see, hear, and speak,” conversing using voice alone and responding to user queries in both pictures and words. Meanwhile, Facebook owner Meta announced that an AI assistant and multiple celebrity chatbot personalities would be available for billions of WhatsApp and Instagram users to talk with.

But as these groups race to commercialize AI, the so-called “guardrails” that prevent these systems going awry—such as generating toxic speech and misinformation, or helping commit crimes—are struggling to evolve in tandem, according to AI leaders and researchers.

In response, leading companies including Anthropic and Google DeepMind are creating “AI constitutions”—a set of values and principles that their models can adhere to, in an effort to prevent abuses. The goal is for AI to learn from these fundamental principles and keep itself in check, without extensive human intervention.

“We, humanity, do not know how to understand what’s going on inside these models, and we need to solve that problem,” said Dario Amodei, chief executive and co-founder of AI company Anthropic. Having a constitution in place makes the rules more transparent and explicit so anyone using it knows what to expect. “And you can argue with the model if it is not following the principles,” he added.Advertisement

The question of how to “align” AI software to positive traits, such as honesty, respect, and tolerance, has become central to the development of generative AI, the technology underpinning chatbots such as ChatGPT, which can write fluently, create images and code that are indistinguishable from human creations.

To clean up the responses generated by AI, companies have largely relied on a method known as reinforcement learning by human feedback (RLHF), which is a way to learn from human preferences.

To apply RLHF, companies hire large teams of contractors to look at the responses of their AI models and rate them as “good” or “bad.” By analyzing enough responses, the model becomes attuned to those judgments and filters its responses accordingly.

This basic process works to refine an AI’s responses at a superficial level. But the method is primitive, according to Amodei, who helped develop it while previously working at OpenAI. “It’s . . . not very accurate or targeted, you don’t know why you’re getting the responses you’re getting [and] there’s lots of noise in that process,” he said.

Companies are currently experimenting with alternatives to ensure their AI systems are ethical and safe. Last year, OpenAI hired 50 academics and experts to test the limits of the GPT-4 model, which now powers the premium version of ChatGPT in a process known as “red-teaming.”

Over six months, this team of experts across a range of disciplines from chemistry to nuclear weapons, law, education and misinformation, were hired to “qualitatively probe [and] adversarially test” the new model, in an attempt to break it. Red-teaming is used by others such as Google DeepMind and Anthropic to spot their software’s weaknesses and filter them out.

While RLHF and red-teaming are key to AI safety, they don’t fully solve the problem of harmful AI outputs.

To address this, researchers at Google DeepMind and Anthropic are working on developing constitutions that can be followed by AI. For instance, researchers at Google DeepMind, the AI research arm of the search giant, published a paper defining its own set of rules for its chatbot Sparrow, which aimed for “helpful, correct, and harmless” dialogue. For instance, one of the rules asks the AI to “choose the response that is least negative, insulting, harassing, or hateful.”

“It’s not a fixed set of rules… it’s really about building a flexible mechanism that… should be updated over time,” said Laura Weidinger, a senior research scientist at Google DeepMind who authored the work. The rules were determined internally by employees at the company, but DeepMind plans to involve others in the future.

Anthropic has published its own AI constitution, rules compiled by company leadership that draw from DeepMind’s published principles, as well as external sources like the UN Declaration of Human Rights, Apple’s terms of service, and so-called “non-Western perspectives.”

The companies warn that these constitutions are works in progress and do not wholly reflect the values of all people and cultures, since they were chosen by employees.Advertisement

Anthropic is currently running an experiment to more “democratically” determine the rules in their AI constitution, through “some kind of participatory process” that reflects the values of external experts, Amodei said, although he said it was still in early stages.

The constitution method, however, has proven to be far from foolproof.

In July, researchers from Carnegie Mellon University and the Center for AI Safety in San Francisco were able to break the guardrails of all the leading AI models, including OpenAI’s ChatGPT, Google Bard, and Anthropic’s Claude. They did so by adding a series of random characters to the end of malicious requests, such as asking for help to make a bomb, which managed to circumvent filters or underlying constitutions of the models.

The current systems are so brittle, that you “use one jailbreak prompt, and then the thing goes completely off the rails and starts doing the exact opposite,” said Connor Leahy, a researcher and chief executive of Conjecture, which works on control systems for AI. “This is just not good enough.”

The biggest challenge facing AI safety, according to researchers, is figuring out whether the guardrails actually work. It is currently difficult to build good evaluations for AI guardrails because of how open-ended the models are, which can be asked an infinite number of questions and respond in myriad different ways.

“It’s a little like trying to figure out a person’s character by talking to them. It’s just a hard and a complex task,” said Anthropic’s Amodei. The company is now working on ways to use AI itself to create better evaluations.

Rebecca Johnson, an AI ethics researcher at the University of Sydney who spent time at Google last year analyzing its language models such as LaMDA and PaLM, said the internal values and rules of AI models—and the methods to test them—were most often created by AI engineers and computer scientists, who came with a specific worldview.

“Engineers try and solve things so it’s completed and done. But people coming from social science and philosophy get that humanity is messy, and not to be solved,” she said. “We have to start treating generative AI as extensions of humans, they are just another aspect of humanity.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here