“Dr. Google” had its issues. Can ChatGPT Improve Health?

by
0 comments
"Dr. Google" had its issues. Can ChatGPT Improve Health?

Some doctors see LLM as a boon for medical literacy. The average patient may have difficulty navigating the vast landscape of online medical information – and, in particular, distinguishing high-quality sources from sophisticated but factually questionable websites – but LLMs can do this work for them, at least in theory. Treating patients who searched for their symptoms on Google required “a lot of attacking the patient’s anxiety (and) reducing the misinformation,” says Mark Susi, MD, a practicing radiologist and associate professor at Harvard Medical School. But now, he says, “you see patients with college education, high school education, who can ask questions at the level of an early med student.”

ChatGPT Health and Anthropic Released later announcement New health integrations for the cloud indicate that AI giants are increasingly willing to accept and encourage health-related uses of their models. Such uses certainly come with risks, given LLM’s well-documented tendency to fabricate information rather than agree with users and admit ignorance.

But those risks also have to be weighed against the potential benefits. Here’s an analogy for autonomous vehicles: When policymakers consider whether to allow Waymo in their city, the key metric is not whether its cars are ever involved in accidents, but whether they cause less harm than the status quo of relying on human drivers. If Dr. ChatGPT is an improvement over Dr. Google – and early evidence suggests it may be – then it could potentially reduce the enormous burden of medical misinformation and unnecessary health anxiety created by the Internet.

However, it is difficult to overestimate the effectiveness of chatbots like ChatGPT or Cloud for consumer health. “It is extremely difficult to evaluate open-ended chatbots,” says Danielle Bitterman, clinical lead for data science and AI at Mass General Brigham health-care system. large language models score well on medical licensing exams, but those exams use multiple-choice questions that don’t reflect how people use chatbots to look up medical information.

Sirisha Rambhatla, assistant professor of management science and engineering at the University of Waterloo, attempted to bridge that gap Evaluating how GPT-4o responded To license exam questions when he did not have access to a list of possible answers. Of the medical experts who evaluated the responses, only half were considered completely correct. But the multiple-choice test questions are designed to be so tricky that the answer options don’t fully flesh them out, and they’re still a very distant approximation to the kind of thing a user would type into ChatGPT.

A separate studywhich tested GPT-4o on more realistic prompts presented by human volunteers, found that it answered medical questions correctly about 85% of the time. When I spoke to Amulya Yadav, an associate professor at Pennsylvania State University who runs the Responsible AI for Social Emancipation Lab and led the study, he made it clear that he was personally not a fan of patient-facing medical LLMs. But he freely admits that, technically speaking, they appear to be capable of the task — after all, he says, human doctors misdiagnose patients 10% to 15% of the time. “If I look at it objectively, it seems like the world is going to change, whether I like it or not,” he says.

Yadav says, for people looking for medical information online, LLM seems to be a better option than Google. Radiologist Sukki also concluded that LLM may be a better alternative to web search. GPT-4’s responses compared Information presented in Google’s Knowledge Panel for questions about common chronic medical conditions, an information box that sometimes appears on the right side of search results.

Since Yadav and Sukki’s study appeared online, in the first half of 2025, OpenAI has released several new versions of GPT, and it is reasonable to expect that GPT-5.2 will perform even better than its predecessors. But the studies have significant limitations: They focus on direct, factual questions, and they examine only brief interactions between users and chatbots or web search tools. Some of the weaknesses of LLMs – particularly their tendency to sycophancy and hallucination – may be more likely to raise their heads in more extensive conversations and with people grappling with more complex problems. Professor Riva Lederman at the University of Melbourne, who studies technology and health, says that patients who do not like the diagnosis or treatment recommendations they receive from a doctor may seek a second opinion from an LLM – and the LLM, if it is flattering, may encourage them to reject their doctor’s advice.

Some studies have found that LLMs will hallucinate and display lewdness in response to health cues. For example, a study It has been shown that GPT-4 and GPT-4o will happily accept and run false drug information included in a user’s question. In one moreGPT-4o often makes up definitions for fake syndromes and laboratory tests mentioned in user prompts. Given the abundance of medically questionable diagnoses and treatments present on the Internet, these patterns of LLM behavior may contribute to the spread of medical misinformation, especially if people perceive LLMs as trustworthy.

OpenAI has pointed out that the GPT-5 series of models are markedly less sycophantic and prone to hallucinations than their predecessors, so the results of these studies may not apply to ChatGPT Health. The company also evaluated the model that powers ChatGPT Health in responding to health-specific questions using the publicly available HeathBench benchmark. HealthBench rewards models who express uncertainty when appropriate, recommend that users seek medical attention when necessary, and avoid creating unnecessary stress by telling users that their condition is more serious than it actually is. It’s fair to assume that ChatGPT Health’s underlying model displayed those behaviors in testing, although Bitterman noted that some of the signals in HealthBench were generated by LLM, not by users, which may limit how well the benchmark translates to the real world.

Related Articles

Leave a Comment