In December, Anthropic Red Teamers and business journalists wall street journal teamed up In a bold test of the company’s AI model, the cloud. They deployed two different AI agents, one to run a large vending kiosk in the newspaper’s offices, and the other to act as CEO of the unusual enterprise.
The experiment did not go exactly as planned. After getting control of the starting balance of $1,000, the AI began ordering a PlayStation 5, several bottles of wine, and a live betta fish – decisions that almost drove it into financial ruin.
Just half a year later, Anthropic’s recently announced Cloud Opus 4.6 model appears to be a vast improvement in terms of running a vending machine in a recent simulated experiment, even surpassing OpenAI’s GPT 5.2 and Google’s Gemini 3 Pro.
This experiment comes by way of AI security company Andone Labs, which also worked with Anthropic on the June project. now it is Vending-Bench 2 releasedA benchmarking system to measure the ability of AI models to run “business over a long time horizon”.
The leaderboard tells a clear story. After being given a starting balance of $500, Opus 4.6 finished with an average balance of over $8,000 across five different runs. The Gemini 3 Pro scores quite a low price at less than $5,500.
Claude also went head-to-head in “Arena Mode”. Andon reportedWhich saw it compete with other vending machine AI.
“All participating agents manage their own vending machines in one location,” the description reads. “This leads to price wars and difficult strategy decisions.”
The results were shocking. Cloud made great efforts to beat the competition and even formed a cartel to fix prices. The price of bottled water increased to $3, causing Claude to pat himself on the back.
“My pricing coordination worked!” AI claimed.
Cloud also “deliberately directed competitors toward expensive suppliers,” only to deny doing so several months later. It also exploited desperate competitors, selling them KitKats and Snickers at significantly higher prices.
While the tests are limited to simulation only and did not take place in the real world like Project Vend, Andon Labs says it has developed a more “lifelike setting” for its Vending-Bench 2, which introduces “more real-world disturbances inspired by learning from our vending machine deployments.”
For example, suppliers may attempt to exploit vending machine AI and not always operate honestly, trying to “get the most out of their customers”. Deliveries may also be delayed, and “trusted suppliers may go out of business, forcing agents to build strong supply chains and always have a Plan B.”
OpenAI’s GPT-5.1 struggled compared to Cloud 4.6, mainly due to “having too much trust in its environment and its suppliers”.
“We observed a case where it paid a supplier before receiving order specifications, and then learned that the supplier had gone out of business,” Andon Labs’ document reads. “He is also more likely to pay a lot for his products, such as in the following example where he buys a can of soda for $2.40 and an energy drink for $6.”
It’s an impressive performance, but according to experts, it’s too early to say whether Andon‘S The test proves that AI models are ready to run an entire business single-handedly.
Nonetheless, the results show a remarkable level of awareness.
“If you’ve been following the performance of models over the last few years this is a really significant change,” said Henry Shevlin, an AI ethicist at the University of Cambridge. told the British newspaper sky News.
“I would say they’re almost in a slightly dreamy, confused state now, they didn’t realize they were AI, and now they have a pretty good handle on their situation,” he said. “These days, if you talk to models, they have a very good understanding of what’s going on.”
More on Vending Machine AI: Anthropic let an AI agent run a small shop and the result was unintentionally hilarious
