Bean says that ideally, health chatbots would be subjected to controlled trials with human users before being released to the public, as was done in their study. This can be a heavy lift, especially considering how fast the AI world moves and how long human studies can take. Bean’s own study used GPT-4o, which came out about a year ago and is now outdated.
Earlier this month, Google released a study that cites Bean’s standards. In the study, patients discussed medical concerns with the company’s Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot that is not yet available to the public, before meeting with a human physician. Overall, the AMIE diagnoses were as accurate as those of the physicians, and none of the interactions raised major safety concerns for the researchers.
Despite the encouraging results, Google is not planning to release AMIE any time soon. “Although research has progressed, there are some important limitations that must be addressed before systems for diagnosis and treatment can be translated into the real world, including further research into equity, fairness, and safety testing,” Alan Karthikesalingam, a research scientist at Google DeepMind, wrote in an email. Google recently revealed that Health100, a health platform it’s building in partnership with CVS, will include an AI assistant powered by its flagship Gemini model, though that device likely won’t be for diagnosis or treatment.
Rodman, who led the AMIE study with Karthikesalingam, doesn’t think such comprehensive, multiyear studies are necessarily the right approach for chatbots like ChatGPT Health and Copilot Health. “There are several reasons why the clinical trial paradigm does not always work in generative AI,” he says. “And that’s where this benchmarking conversation comes in. Are there benchmarks from a credible third party that we can agree are meaningful, that labs can hold themselves to?”
Those keys are “third party”. No matter how thoroughly companies evaluate their products, it is difficult to completely trust their findings. Third party evaluation not only brings objectivity, but also helps avoid blind spots if multiple third parties are involved.
OpenAI’s Singhal says he is strongly in favor of external evaluation. “We do our best to support the community,” he says. “Part of why we put together HealthBench was really to give the community and other model developers an example of what a really good evaluation looks like.”
Given how expensive it is to produce high-quality evaluations, he says, he doubts that any individual academic laboratory will be able to produce what he calls “one evaluation to rule them all.” But he applauds efforts by academic groups to bring together pre-existing and novel evaluations into comprehensive evaluation suites — such as Stanford’s MedHELM framework, which tests models on a variety of medical tasks. Currently, OpenAI’s GPT-5 holds the highest MedHELM score.
Nigam Shah, a professor of medicine at Stanford University who leads the MedHELM project, says it has limitations. Typically, it only evaluates individual chatbot responses, but someone seeking medical advice from a chatbot tool can engage it in a multi-turn, back-and-forth conversation. He says he and some colleagues are preparing to create an assessment that can score those complex conversations, but it will take time and money. “You and I have no ability to stop these companies from releasing (health-oriented products), so they’re going to do whatever they want,” he says. “The only thing people like us can do is find a way to fund the benchmarks.”
No one interviewed for this article argued that health LLMs need to perform excellently on third-party assessments in order to be awarded. Doctors themselves make mistakes – and for someone who has only occasional access to a doctor, a constantly accessible LLM who occasionally messes up may still be a vast improvement on the status quo, as long as its errors are not too serious.
However, with the current state of the evidence, it is impossible to know for sure whether currently available devices actually produce improvements, or whether their risks outweigh their benefits.