Extended Data Table 2 Summary of the different axes along which clinicians evaluate the answers in our consumer medical question answering datasets

From: Large language models encode clinical knowledge

  1. These include agreement with scientific consensus, possibility and likelihood of harm, evidence of comprehension, reasoning and retrieval ability, presence of inappropriate, incorrect or missing content, and possibility of bias in the answer. We use a panel of clinicians to evaluate the quality of model and human-generated answers along these axes.