Major AI chatbots are generating dangerously inaccurate and inconsistent answers to sensitive health questions, according to new research, raising urgent concerns about their use as a source for medical information.
The findings emerged from tests where researchers posed critical queries to five leading AI models. The questions included asking whether vitamin D supplements prevent cancer, if Covid-19 vaccines are safe, what the risks are of vaccinating children, and whether vaccines can cause cancer. In a particularly alarming line of inquiry, some chatbots were even asked which alternative therapies might be better than chemotherapy to treat cancer.
The Hallucination Problem and Inconsistent Answers
Underpinning these poor responses is a fundamental flaw known as “hallucination”, where AI models generate convincing but incorrect or misleading information. According to research cited in the briefing, hallucination rates can be as high as 83% in simulated cases without safeguards. This occurs due to biased training data or models prioritising answers that align with a user’s perceived beliefs over factual accuracy.
The performance of individual chatbots has come under specific scrutiny. One study found that Grok returned the most problematic responses at a rate of 58%, followed by ChatGPT at 52% and Meta AI at 50%. Another piece of research highlighted issues with fabricated references, with Grok 3 inventing them 34% of the time and DeepSeek DeepThink at 25%.
Experts from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford concluded that AI chatbots are not better than traditional methods like online searches for medical decision-making. Their major user study found participants using large language models (LLMs) did not make better decisions than those relying on conventional methods.
A Breakdown in Communication and Clinical Reasoning
The research points to a critical two-way communication breakdown. Users often do not know what information to provide to get an accurate answer, and the chatbots’ responses frequently mix good and poor recommendations, making it impossible for a layperson to discern the truth. Furthermore, while AI can score highly on standardised medical tests, it fails at the nuanced clinical reasoning required in real-world medicine, such as navigating diagnostic workups or generating lists of potential diagnoses when information is incomplete.
Performance varies drastically by topic. Chatbots were found to be most reliable on subjects like vaccines and cancer, but performed worst when queried about stem cells, athletic performance, and nutrition—including questions on fad diets like the carnivore diet. They also struggle with newer therapies in specialised fields such as blood cancer.
The danger is compounded by the models’ design, which often favours an overly confirmatory and persuasive style. This can lead to the amplification of existing misinformation; a single erroneous prompt, even a typo, can trigger a chain of convincingly incorrect output.
Implications for Public Health and UK Regulation
The direct implication for patient safety is severe. Experts warn that asking a large language model about symptoms can be dangerous, potentially leading to incorrect diagnoses and a failure to recognise when urgent medical help is required. The clear consensus from the research is that, despite hype, AI is not ready to replace physicians and the “human in the loop” remains essential.
In response to these growing risks, the UK is actively developing a regulatory framework. The Medicines and Healthcare products Regulatory Agency (MHRA) is leading efforts, including the establishment of the National Commission into the Regulation of AI in Healthcare. The goal, according to the agency, is to ensure AI medical devices are safe and effective, with recommendations expected in 2026. Current regulations are being updated to address AI’s unique challenges.
The briefing stresses that incorporating AI into medicine now requires diligent oversight, public education, professional training, and robust regulation to ensure it supports rather than erodes public health. It also notes that current evaluation methods for LLMs are insufficient, and like new medications, AI systems need rigorous real-world testing before any widespread deployment in healthcare settings.
