Home AI Before launching, GPT-4o broke records on chatbot leaderboard under a secret name

Before launching, GPT-4o broke records on chatbot leaderboard under a secret name

May 14, 2024

Anonymous chatbot that mystified and frustrated experts was OpenAI’s latest model.

On Monday, OpenAI employee William Fedus confirmed on X that a mysterious chat-topping AI chatbot known as “gpt-chatbot” that had been undergoing testing on LMSYS’s Chatbot Arena and frustrating experts was, in fact, OpenAI’s newly announced GPT-4o AI model. He also revealed that GPT-4o had topped the Chatbot Arena leaderboard, achieving the highest documented score ever.

Mysterious “gpt2-chatbot” AI model appears suddenly, confuses experts

“GPT-4o is our new state-of-the-art frontier model. We’ve been testing a version on the LMSys arena as im-also-a-good-gpt2-chatbot,” Fedus tweeted.

Chatbot Arena is a website where visitors converse with two random AI language models side by side without knowing which model is which, then choose which model gives the best response. It’s a perfect example of vibe-based AI benchmarking, as AI researcher Simon Willison calls it.

An LMSYS Elo chart shared by William Fedus, showing OpenAI's GPT-4o under the name "im-also-a-good-gpt2-chatbot" topping the charts. — Enlarge / An LMSYS Elo chart shared by William Fedus, showing OpenAI’s GPT-4o under the name “im-also-a-good-gpt2-chatbot” topping the charts.William Fedus

The gpt2-chatbot models appeared in April, and we wrote about how the lack of transparency over the AI testing process on LMSYS left AI experts like Willison frustrated. “The whole situation is so infuriatingly representative of LLM research,” he told Ars at the time. “A completely unannounced, opaque release and now the entire Internet is running non-scientific ‘vibe checks’ in parallel.”

“gpt2-chatbots have just surged to the top, surpassing all the models by a significant gap (~50 Elo). It has become the strongest model ever in the Arena,” wrote the lmsys.org X account while sharing a chart. “This is an internal screenshot,” it wrote. “Its public version ‘gpt-4o’ is now in Arena and will soon appear on the public leaderboard!”

An an internal screenshot of the LMSYS Chatbot Arena leaderboard showing "im-also-a-good-gpt2-chatbot" leading the pack. We now know that it's GPT-4o. — Enlarge / An an internal screenshot of the LMSYS Chatbot Arena leaderboard showing “im-also-a-good-gpt2-chatbot” leading the pack. We now know that it’s GPT-4o.LMSYS

As of this writing, im-also-a-good-gpt2-chatbot held a 1309 Elo versus GPT-4-Turbo-2023-04-09’s 1253, and Claude 3 Opus’s 1246. Claude 3 and GPT-4 Turbo had been duking it out on the charts for some time before the three gpt2-chatbots appeared and shook things up.

I’m a good chatbot

For the record, the “I’m a good chatbot” in the gpt2-chatbot test name is a reference to an episode that occurred while a Reddit user named Curious_Evolver was testing an early, “unhinged” version of Bing Chat in February 2023. After an argument about what time Avatar 2 would be showing, the conversation eroded quickly.

Before launching, GPT-4o broke records on chatbot leaderboard under a secret name

Anonymous chatbot that mystified and frustrated experts was OpenAI’s latest model.

FURTHER READING

I’m a good chatbot

FURTHER READING

LEAVE A REPLY Cancel reply

How one YouTuber is trying to poison the AI bots stealing...

Anonymous chatbot that mystified and frustrated experts was OpenAI’s latest model.

FURTHER READING

I’m a good chatbot

FURTHER READING

RELATED ARTICLESMORE FROM AUTHOR

How one YouTuber is trying to poison the AI bots stealing her content

The questions the Chinese government doesn’t want DeepSeek AI to answer

Couple allegedly tricked AI investors into funding wedding, houses

LEAVE A REPLY Cancel reply

How one YouTuber is trying to poison the AI bots stealing...

RELATED ARTICLES MORE FROM AUTHOR