JMLA - Evaluating a large language model's ability to answer clinicians' requests for evidence summariesJan. 2025
Results: Of the 216 evaluated questions, aiChat’s response was assessed as “correct” for 180 (83.3%) questions, “partially correct” for 35 (16.2%) questions, and “incorrect” for 1 (0.5%) question. Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate.