Extended Data Table 2 Detailed comparison between human clinical decisions and AI predictions

From: International evaluation of an AI system for breast cancer screening

  1. a, Comparison of sensitivity and specificity between human benchmarks (derived retrospectively from the clinical record) and the predictions of the AI system. Score thresholds were chosen, on the basis of separate validation data, to match or exceed the performance of each human benchmark (see Methods section ‘Selection of operating points’). These points are depicted graphically in Fig. 2. Note that the number of cases (N) differs from Fig. 2 because the opinion of the radiologist was not available for all images. We also note that sensitivity and specificity metrics are not easily comparable to most previous publications in breast imaging (for example, the DMIST Trial34), given the differences in follow-up interval. Negative cases in the US dataset were upweighted to account for the sampling protocol (see Methods section ‘Inverse probability weighting’). b, Same columns as a, but using a version of the AI system that was trained exclusively on the UK dataset. It was tested on the US dataset to show generalizability of the AI across populations and healthcare systems. Superiority comparisons on the UK data were conducted using Obuchowski’s extension of the two-sided McNemar test for clustered data. Non-inferiority comparisons were Wald tests using the Obuchowski correction. Comparisons on the US data were performed with a two-sided permutation test. All P values survived correction for multiple comparisons (see Methods section ‘Statistical analysis’). Quantities in bold represent estimated differences that are statistically significant for superiority; all others are statistically non-inferior at a pre-specified 5% margin.