In A New Study, GPT-4 Outperformed Physicians In Clinical Reasoning, But Was Also Wrong More Often

In a new study, scientists at Beth Israel Deaconess Medical Center (BIDMC) compared a large language model’s clinical reasoning capabilities against human physician counterparts. The investigators used the revised-IDEA (r-IDEA) score, which is a commonly used tool to assess clinical reasoning.

The study entailed giving a GPT-4 powered chatbot, 21 attending physicians, and 18 resident physicians 20 clinical cases to establish diagnostic reasoning for and work through. All three sets of answers were then evaluated using the r-IDEA score. The investigators found that the chatbot actually earned the highest r-IDEA scores, which actually proved to be quite impressive with regards to diagnostic reasoning. However, the authors also noted that the chatbot was “just plain wrong” more often.

Stephanie Cabral, M.D., the lead author of the study, explained that “further studies are needed to determine how LLMs can best be integrated into clinical practice, but even now, they could be useful as a checkpoint, helping us make sure we don’t miss something.” Summarily, the results indicated sound reasoning by the chatbot, however significant mistakes as well; this further bolsters the idea that these AI powered systems are best fit (atleast at their current maturity levels) as tools to augment a physician’s practice, rather than replace a physician’s diagnostic capabilities.

SAN FRANCISCO, CALIFORNIA – NOVEMBER 06: OpenAI CEO Sam Altman speaks during the OpenAI DevDay event … [+] on November 06, 2023 in San Francisco, California. Altman delivered the keynote address at the first-ever Open AI DevDay conference.(Photo by Justin Sullivan/Getty Images)

Getty Images

As is often explained by physician leaders and technologists alike, this is because the practice of medicine is not purely based on algorithmic outputs of rules, but is rather based on a deep sense of reasoning and clinical intuition, which is challenging to replicate by an LLM. Nevertheless, tools like these which can provide diagnostic or clinical support can still be an incredibly powerful asset in the physician workflow. For example, if systems can reasonably provide a “first-pass” or initial diagnosis suggestion based on the available data such as the patient history or existing records, it may allow physicians to save a significant amount of time in their diagnostic process. Furthermore, if these tools can augment the workflow of a physician and improve their means to process a large amount of clinical information from the medical record, there may be opportunities to increase efficiencies.

Many organizations are taking advantage of these potential means for clinical augmentation. For example, artificial intelligence powered scribing technologies are leveraging natural language processing to help physicians complete clinical documentation more efficiently. Enterprise search tools are being integrated within organizations and with EMR systems to help physicians search large swaths of data, promote data interoperability, and glean quicker and deeper insights on existing patient data. Other systems may even help offer an initial diagnosis; for example, tools are emerging in the fields of radiology and dermatology that are able to suggest a potential diagnosis by analyzing an uploaded photo.

Nevertheless, there is still a lot of work that needs to be done in this arena. Simply put, although AI systems such as these are not ready for clinical diagnostics, there may still be an opportunity to leverage this technology to augment clinical workflows, especially while keeping a human in the loop to ensure safe, secure, and accurate processes.

This post was created with our nice and easy submission form. Create your post!