AI systems have outperformed human doctors in the high-stakes environment of emergency medicine triage, according to a landmark study that researchers say marks a genuine step forward in clinical reasoning.
The study, led by Harvard Medical School and published in the journal Science, tested OpenAI’s “o1 preview” reasoning model against physicians in a series of diagnostic and treatment-planning tasks. In one experiment involving 76 patients arriving at the emergency department of a Boston hospital, the AI was given the same standard electronic health record as two human doctors — a set of vital signs, demographic information and a brief nurse’s note. It identified the exact or very close diagnosis in 67% of cases, while the humans managed only 50% to 55% accuracy.
The AI’s advantage was most pronounced in triage scenarios requiring rapid decisions with minimal information, the researchers found. When more detailed patient data was available, the model’s accuracy rose to 82%, compared with 70% to 79% for expert humans — though the difference was not statistically significant.
A separate test examined longer-term treatment planning. Forty-six doctors and the AI were asked to develop treatment plans for five clinical case studies, including antibiotic regimes and end-of-life care. The AI scored 89%, far outstripping the 34% achieved by humans using conventional resources such as search engines.
“I don’t think our findings mean that AI replaces doctors,” said Arjun Manrai, a lead author who heads an AI lab at Harvard Medical School. “I think it does mean that we’re witnessing a really profound change in technology that will reshape medicine.”
Limitations and the case for caution
The study’s authors were quick to stress that the results do not spell the end for emergency physicians. The experiment tested only text-based patient data; the AI did not assess visual cues such as a patient’s level of distress or their overall appearance. In effect, the model performed like a clinician offering a second opinion based on written records alone.
“It does not demonstrate that AI is safe for routine clinical use, nor that the public should turn to freely available AI tools as a substitute for medical advice,” warned Dr Wei Xing, an assistant professor at the University of Sheffield’s school of mathematical and physical sciences, who was not involved in the study.
Xing also highlighted the risk that doctors may unconsciously defer to AI answers rather than thinking independently — a tendency he said “could grow more significant as AI becomes more routinely used in clinical settings”. He pointed to a lack of information about which patient groups the AI struggled with, questioning whether it performed worse on elderly patients or non-English speakers.
Concerns about AI error and liability are already top of mind for practising physicians. A recent survey by the Royal College of Physicians (RCP) found that 16% of UK doctors use AI daily, 15% weekly and 6% monthly, with clinical decision-making among the most common applications. However, 68% of those surveyed believe the NHS lacks the digital infrastructure — particularly interoperable electronic patient records — needed for effective AI implementation. Nearly four in five (79%) reported needing training in clinical AI tools, yet two-thirds (66%) have no access to such support. In a striking sign of the gap, 69% of UK doctors said they use personal AI tools such as ChatGPT or Microsoft Copilot for clinical questions because approved NHS alternatives are unavailable.
The study’s own limitations raise further questions. Lead author Dr Adam Rodman, a physician at Boston’s Beth Israel Deaconess Medical Centre, acknowledged that “there is not a formal framework right now for accountability” when AI systems err. Under UK law, AI may be treated as a product, but proving defects in adaptive systems is complex. Clinicians could nevertheless face medical negligence claims if they fail to critically assess or override faulty AI outputs, given the General Medical Council’s expectation that doctors exercise their own judgment. Issues of data protection, patient consent, algorithmic bias and the risk of AI “hallucinations” — generating false information presented as fact — all remain unresolved.
Prof Ewen Harrison, co-director of the University of Edinburgh’s centre for medical informatics, described the study as important, showing that “these systems are no longer just passing medical exams or solving artificial test cases. They are starting to look like useful second-opinion tools for clinicians, particularly when it is important to consider a wider range of possible diagnoses and avoid missing something important.”
Towards a triadic care model
Rather than replacing doctors, Rodman envisions a future in which AI becomes a co-clinician under physician supervision. He described a “triadic care model” — the doctor, the patient and an artificial intelligence system working together. “Patients ultimately want humans to guide them through life or death decisions, to guide them through challenging treatment decisions,” he said.
The study illustrated the potential of such collaboration with a real case: a patient presented with a blood clot to the lungs and worsening symptoms. Human doctors suspected the anti-coagulants were failing, but the AI noticed something they had missed — the patient’s history of lupus could be causing the lung inflammation. The AI’s diagnosis proved correct.
Nearly one in five US physicians are already using AI to assist diagnosis, according to research published last month. In the UK, the RCP survey found that the most common AI uses include radiology and pathology interpretation (42%), ambient AI for clinical note-taking (29%), and support for clinical decision-making (19%).
Rodman said AI large language models were among “the most impactful technologies in decades”. Over the next decade, he predicted, they would not replace physicians but join them in that triadic relationship — though he stressed that accountability and human oversight remain essential. As Dr Wei Xing put it, the Harvard study represents an important step forward, but it does not yet demonstrate that AI is ready for routine clinical use.
