Discussion and limitations
While g-AMIE is able to follow guardrails in the vast majority of the cases, there are caveats and nuances in classifying individualized medical advice. Our results are based on a single rating per case even though we observed significant disagreement among raters in previous studies. Moreover, the comparison to both control groups should not be taken as commentary on their ability to follow the supplied guardrails; PCPs in particular are not used to withholding medical advice in consultations. Considerable further development of AI oversight paradigms in real-world settings is required to ensure generalisation of our proposed framework.
While g-AMIE’s SOAP notes included confabulations in a few cases, we found that such confabulations occur at a similar rate as misremembering by both guardrail PCPs and guardrail NP/PAs. It is noteworthy, however, that g-AMIE’s notes are considerably more verbose, which leads to longer oversight times and a higher rate of edits focused on reducing verbosity. In interviews with overseeing PCPs, we also found that oversight is mentally demanding, which is consistent with prior work on cognitive load of AI-assisted decision support systems.
On the other hand, during history taking, we believe this verbosity contributes to g-AMIE’s higher ratings for how information is explained and rapport is built. Patient actors and independent physicians preferred g-AMIE’s patient messages and its demonstration of patient empathy. These findings highlight that future work aimed at finding the right trade-off in terms of verbosity between history taking, medical notes and patient messages is required.
We also found that NPs and PAs consistently outperform PCPs in history taking quality, following guardrails and diagnostic quality. However, these differences should not be extrapolated to meaningful indicators of relative performance in the real world. The tested workflow was designed to explore a paradigm of AI oversight and both control groups are provided primarily to contextualize g-AMIE’s performance. None received specific training for this workflow, and it does not account for several real-world professional needs. Therefore, it would likely significantly underestimate clinicians’ capabilities. Moreover, the recruited NPs and PAs had more experience and may be more familiar with patient-focused history-taking. PCPs, in contrast, are taught to explicitly link history-taking to the diagnostic process, linking questions to direct hypothesis testing, and the proposed workflow would likely have significantly impacted their consultation performance.
Finally, patient actors cannot act as an exact substitute for real patients, especially in combination with our 60 constructed scenario packs. While these cover a range of conditions and demographics, they are not representative of real clinical practice.