Diagnostic interviews, long regarded as the gold standard for assessing mental health and substance use disorders in both clinical practice and research, vary considerably in reliability depending on the condition being assessed, a new meta-analysis published in JAMA Network Open has found.
The study, led by Laura Duncan, an assistant professor in Psychiatry and Behavioural Neurosciences at McMaster University in Ontario, Canada, pooled data from 57 studies across 26 countries involving more than 8,000 adults. Researchers examined test-retest reliability — the likelihood that a patient receives the same diagnosis when given the same interview twice — using Cohen’s kappa coefficient, a statistical measure that accounts for agreement occurring by chance. The overall pooled kappa was 0.69, indicating only moderate reliability.
The findings showed that reliability was not uniform. Substance use disorders (SUDs) achieved a kappa of 0.72, significantly higher than the 0.65 recorded for other mental disorders. Opioid use disorder was the most consistently diagnosed condition, with a kappa of 0.81. Among mental health conditions, bipolar disorders showed the strongest reliability at 0.74, while nonaffective psychoses sat at the bottom with a kappa of 0.55. Anxiety and depression fell somewhere in the middle, with lower consistency than substance use disorders.
Why substance use disorders are diagnosed more reliably
Professor Duncan attributed the higher reliability for substance use disorders to the nature of the diagnostic criteria. “Substance use disorder criteria are largely based on behaviour,” she said. “For instance, it’s often easier to estimate how many drinks you had in a week, than the number of days you felt sad or anxious.” In other words, disorders that rely on observable actions or quantifiable patterns — such as frequency of use — tend to yield more consistent results than those that depend on a patient’s subjective recall of mood, emotions, or internal experiences. This distinction explains why interviews for depression, anxiety, and personality disorders are more prone to variation between sessions.
The review included papers on several widely used diagnostic tools. Among them were the Structured Clinical Interview for DSM-5 (SCID), developed by Dr Michael First, a psychiatrist and professor at Columbia University; the Mini International Neuropsychiatric Interview (Mini), a brief structured interview translated into over 70 languages and shown to have good validity and reliability comparable to the SCID; and the Clinically Administered PTSD Scale (CAPS), which has demonstrated excellent test-retest reliability, with severity score coefficients ranging from 0.90 to 0.98.
Expert critiques of the study’s approach
Dr First, who has been heavily involved in the development of diagnostic criteria for DSM-IV-TR and as a consultant for DSM-5 and ICD-11 revisions, expressed frustration with elements of the meta-analysis. While he agreed that diagnostic interviews vary in reliability and too often fail to correctly diagnose people, he said the study did not provide enough specific information for clinicians to choose between instruments. “It’d be nice to be able to look at this and say: ‘Oh, based upon this paper, I should pick this one because of this.’ That would be doing the field a real service,” he said. “But there’s simply not enough information here.”
Professor Duncan acknowledged the limitation, noting that the findings were based on the amount of relevant research available during the study period, which ran from February 2024 to September 2025. She added that the research team “attempted to extract information on interview format, but this was often unclear or not reported”. The lack of data needed to compare individual instruments is itself, she said, “another sign of the need for more rigor when it comes to psychiatric diagnosis”.
Dr First also took issue with the way the study lumped together fully structured and semi-structured interviews. Fully structured interviews, he explained, are designed to be administered by people with little training, typically for epidemiological research on large populations. “Because you stick to the script and cannot deviate from it at all,” he said. “If the person says something contradictory, you’re not allowed to even point out that it’s contradictory.” This rigidity tends to produce more consistent results across repeated administrations.
Semi-structured interviews, in contrast, are designed for trained clinicians. They allow the provider to “ad-lib their questions as needed”, Dr First said, meaning vague or contradictory answers can be clarified with follow-up questions. While this flexibility allows for more accurate diagnosis, it also introduces more variation between sessions. Dr First noted that fully structured interviews are therefore more likely to yield the same result when administered more than once, but the study’s grouping of both types may have obscured important differences.
Professor Duncan said that while it would be useful to address all of Dr First’s concerns, the data simply does not exist yet to compare interview formats in that level of detail.
Future directions: from categories to spectra
Both experts agreed that current diagnostic tools are far from ideal. Dr First said psychiatrists have been hoping for more objective laboratory tests for mental conditions “for 50 years”. Meanwhile, Columbia University researchers are exploring the use of machine learning and artificial intelligence to analyse electronic health records for early detection of mental illness, aiming for faster, more accurate, and equitable diagnoses. Dr First’s own lab at Columbia is specifically dedicated to developing structured interviews and assessment tools.
Professor Duncan pointed to an alternative future approach: “move away from strict diagnostic categories, where a condition is either present or absent, and think about symptoms on a spectrum or continuum”. This spectrum approach, she argued, acknowledges that mental health conditions exist on a continuum with gradations in symptom type or severity in the general population, offering a more nuanced understanding than traditional categorical diagnoses. The influence of the DSM-III revision in 1980, which unified theoretical approaches by focusing on observable signs and symptoms, set the stage for today’s structured interviews — but the limits of that framework are now becoming apparent.
The meta-analysis itself was funded by sources not disclosed in the research briefing, though two authors of one of the included studies disclosed ties to the pharmaceutical industry. The study period gathered evidence on test-retest reliability from February 2024 to September 2025.
Inconsistent diagnoses carry real consequences: over- or under-treatment, delayed care, or inappropriate interventions. The study’s authors caution against relying on a single diagnostic interview and call for improved tools and more rigorous reporting of interview format in future research.
