Last September roboticist Benji Holson posted “humanoid olympic games“: A set of increasingly difficult tests for humanoid robots that they demonstrated themselves wearing silver bodysuits. Challenges, such as opening a door with a round door handle, started out easy, at least for a human, and progressed to “gold medal” tasks such as properly buttoning and hanging a men’s dress shirt and using a key to open a door.
Holson said the hard work isn’t the glitz. While other competitions feature robots playing games and dancing, Holson argued that the robots we really want are ones that can wash clothes and cook food.
He expected the challenges would take years to resolve. Instead, within months, robotics company Physical Intelligence Completed 11 out of 15 challenges-From bronze to gold – with a robot that washed windows, spread peanut butter and used bags of dog poop.
On supporting science journalism
If you enjoyed this article, consider supporting our award-winning journalism Subscribing By purchasing a subscription, you are helping ensure a future of impactful stories about the discoveries and ideas shaping our world today.
scientific American Holson about why vision-based or camera-based systems alone are outperforming his expectations and how close we are to a truly useful machine. They have since released a New, more difficult set of challenges.
(An edited transcript of the interview follows.)
You have designed these challenges to be difficult. Were you surprised that the results came so quickly?
It was much faster than I expected. When I chose challenges, I was trying to calibrate them so that some of the bronze challenges would be completed in the first month or two, then silver and gold would be completed over the next six months, and the hardest challenges might take a year and a half. It’s ridiculous to have them do basically almost all the work in the first three months.
What made this possible?
I started from the premise that we have things that look impressive on a fairly narrow set of tasks – vision only, no touch, simple manipulations, incredible accuracy. It limits what you can be good at. I tried to think of tasks that would require us to move forward from that set. It turns out that I’ve vastly underestimated what’s possible with vision-only and simple manipulators.
When I visited Physical Intelligence, I learned that they have no force sensing. They’re doing all this 100 percent vision-based. Key-insertion functions, spreading peanut butter – I thought these would require force input. But apparently you put more video performance on it, and it works.
How do you actually train a robot to do this without coding it line by line?
All this is being learned through performance. Someone teleoperates a robot doing a task hundreds of times, they train a model based on it, and then the robot can perform the task.
There is a lot of confusion about whether large language models (LLMs) are useless for robots. Are they?
I was quite doubtful about the utility of LLM in robotics. The problem they were good at solving two or three years ago was high-level planning—”If I want to make tea, what are the steps to take?” Sequencing the steps is the easy part. Lifting a tea pot and filling it is indeed a challenging task.
On the other hand, we have started building vision-action models using the same Transformer architecture (as used in LLM). You can use Transformers for text in, text out, image in, text out – but also for image in, robot action out.
The good thing is that they are starting with models pre-trained on text, images, maybe videos. Before you start training your specific task, the AI ​​​​already understands what a teapot is, what water is, maybe you want to fill the teapot with water. So when training your work, it doesn’t need to start with “let me figure out what geometry is”. It might start with, “I see, we’re passing the teapot around” – which, strangely enough, works.
How did you think about the “Olympic” tasks?
So part of it was a challenge and part of it was a prediction. I tried to think of the next set of things that we can’t do now and that someone will be able to do soon.
Humans rely on touch to perform tasks such as finding keys in their pockets. How can we avoid this in robotics?
This is a very good question that we don’t know the answer to yet. Touch technology is much worse, more expensive, fragile and far behind cameras. Cameras, we have been working for a long time.
The big question is, are the cameras enough? Both Physical Intelligence and Sunday Robotics (which completed the bronze-medal task). rolling matching socks) bet that having a camera on the wrist, very close to the fingers, gives you the power to see how everything is being destroyed. When the robot grabs something, it sees that the fingers have some rubber that deflects; The object is deflected, and it exerts forces from it. When applying peanut butter to bread, the robot watches the knife move down and crush the bread and gauges the strength from that. It works much better than I expected.
What about security?
The energy required to remain balanced is often quite high. If a robot is falling, it’s a very fast, hard acceleration to get the legs forward in time. Your system would have to inject a lot of energy into the world – and that’s unsafe.
I’m a big fan of the Centaur robot – mobile wheel base with arms and head. For safety’s sake, this is an easy way to get there quickly. If a humanoid loses power, it will fall down. It seems the general plan is to make a robot so incredibly valuable that we as a society create a new safety class for it – like bicycles and cars. They are dangerous but so valuable that we tolerate the risk.
Have these results changed your timeline?
I thought home robots were at least 15 years away. Now I think of at least six. The difference is that I thought it would take a long time for humans to do anything useful in space, even as a demo, that it was plausible.
But roboticists have seen again and again that there’s a long road between “this worked in a lab and I got a video” and “I can sell a product.” Waymo driving on the roads in 2009; I couldn’t buy a ride until 2024. It takes a long time to gain credibility.
What is the biggest hurdle remaining?
Reliability and Security – The content displayed by Physical Intelligence is incredibly impressive, but if you put it on a different table with different lighting and use a different sock, it may not work. Each step toward generalization appears to take orders of magnitude more data, turning days of data collection into weeks or months.
