Doctors experimenting with AI tools to help diagnose patients is nothing new. But getting them to trust the AIs they’re using is another matter entirely.
To establish that trust, researchers at Cornell University attempted to create a more transparent AI system that works by counseling doctors in the same way a human colleague would — that is, arguing over what the medical literature says.
Their resulting study, which will be presented at the Association for Computing Machinery Conference on Human Factors in Computing Systems later this month, found that how a medical AI works isn’t nearly as important to earning a doctor’s trust as the sources it cites in its suggestions.
“A doctor’s primary job is not to learn how AI works,” said Qian Yang, an assistant professor of information science at Cornell who led the study, in a press release. “If we can build systems that help validate AI suggestions based on clinical trial results and journal articles, which are trustworthy information for doctors, then we can help them understand whether the AI is likely to be right or wrong for each specific case.”
After interviewing and surveying a group of twelve doctors and clinical librarians, the researchers found that when these medical experts disagree on what to do next, they turn to the relevant biomedical research and weigh up its merits. Their system, therefore, aimed to emulate this process.
“We built a system that basically tries to recreate the interpersonal communication that we observed when the doctors give suggestions to each other, and fetches the same kind of evidence from clinical literature to support the AI’s suggestion,” Yang said.
The AI tool Yang’s team created is based on GPT-3, an older large language model that once powered OpenAI’s ChatGPT. The tool’s interface is fairly straightforward: on one side, it provides the AI’s suggestions. The other side contrasts this with relevant biomedical literature the AI gleaned, plus brief summaries of each study and other helpful nuggets of information like patient outcomes.
So far, the team has developed their tool with three different medical specializations: neurology, psychiatry, and palliative care. When the doctors tried the versions tailored to their respective field, they told the researchers that they liked the presentation of the medical literature, and affirmed they preferred it to an explanation of how the AI worked.
While the feedback sounds promising, the study surveyed the opinions of only a dozen experts, a small sample size that’s unlikely to be generalizable.
Either way, this specialized AI seems to be faring better than ChatGPT’s attempt of playing the doctor in a larger study, which found that 60 percent of its answers to real medical scenarios disagreed with human experts’ opinions or were too irrelevant to be helpful.
But the jury is still out on how the Cornell researchers’ AI would hold up when subjected to a similar analysis.
Overall, it’s worth noting that while these tools may be helpful to doctors who have years of expertise to inform their decisions, we’re still a very long way out from an “AI medical advisor” that can replace them.