Skip to main content
Log in

Gradient-Based Saliency Maps Are Not Trustworthy Visual Explanations of Automated AI Musculoskeletal Diagnoses

  • Published:
Journal of Imaging Informatics in Medicine Aims and scope Submit manuscript

Abstract

Saliency maps are popularly used to “explain” decisions made by modern machine learning models, including deep convolutional neural networks (DCNNs). While the resulting heatmaps purportedly indicate important image features, their “trustworthiness,” i.e., utility and robustness, has not been evaluated for musculoskeletal imaging. The purpose of this study was to systematically evaluate the trustworthiness of saliency maps used in disease diagnosis on upper extremity X-ray images. The underlying DCNNs were trained using the Stanford MURA dataset. We studied four trustworthiness criteria—(1) localization accuracy of abnormalities, (2) repeatability, (3) reproducibility, and (4) sensitivity to underlying DCNN weights—across six different gradient-based saliency methods (Grad-CAM (GCAM), gradient explanation (GRAD), integrated gradients (IG), Smoothgrad (SG), smooth IG (SIG), and XRAI). Ground-truth was defined by the consensus of three fellowship-trained musculoskeletal radiologists who each placed bounding boxes around abnormalities on a holdout saliency test set. Compared to radiologists, all saliency methods showed inferior localization (AUPRCs: 0.438 (SG)–0.590 (XRAI); average radiologist AUPRC: 0.816), repeatability (IoUs: 0.427 (SG)–0.551 (IG); average radiologist IOU: 0.613), and reproducibility (IoUs: 0.250 (SG)–0.502 (XRAI); average radiologist IOU: 0.613) on abnormalities such as fractures, orthopedic hardware insertions, and arthritis. Five methods (GCAM, GRAD, IG, SG, XRAI) passed the sensitivity test. Ultimately, no saliency method met all four trustworthiness criteria; therefore, we recommend caution and rigorous evaluation of saliency maps prior to their clinical use.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Code and Data Availability

We make our code repository available upon publication at https://github.com/kvenkatesh5/saliency-trustworthiness. All imaging data used is in the public domain and available in the references provided in the manuscript text.

References

  1. P. Rajpurkar and M. P. Lungren, “The Current and Future State of AI Interpretation of Medical Images,” N. Engl. J. Med., vol. 388, no. 21, pp. 1981–1990, May 2023.

    Article  PubMed  Google Scholar 

  2. A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the recent architectures of deep convolutional neural networks,” Artif. Intell. Rev., vol. 53, no. 8, pp. 5455–5516, Dec. 2020.

    Article  Google Scholar 

  3. P. Rajpurkar et al., “CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning,” arXiv [cs.CV], 14-Nov-2017.

  4. P. Rajpurkar et al., “Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists,” PLoS Med., vol. 15, no. 11, p. e1002686, Nov. 2018.

    Article  PubMed  PubMed Central  Google Scholar 

  5. R. Ranjbarzadeh, A. Bagherian Kasgari, S. Jafarzadeh Ghoushchi, S. Anari, M. Naseri, and M. Bendechache, “Brain tumor segmentation based on deep learning and an attention mechanism using MRI multi-modalities brain images,” Sci. Rep., vol. 11, no. 1, p. 10930, May 2021.

  6. L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal, “Explaining Explanations: An Overview of Interpretability of Machine Learning,” arXiv [cs.AI], 31-May-2018.

  7. J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann, “Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study,” PLoS Med., vol. 15, no. 11, p. e1002683, Nov. 2018.

    Article  PubMed  PubMed Central  Google Scholar 

  8. J. Teneggi, P. H. Yi, and J. Sulam, “Examination-level Supervision for Deep Learning–based Intracranial Hemorrhage Detection at Head CT,” Radiology: Artificial Intelligence, p. e230159, Dec. 2023.

  9. N. Bien et al., “Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet,” PLoS Med., vol. 15, no. 11, p. e1002699, Nov. 2018.

    Article  PubMed  PubMed Central  Google Scholar 

  10. A. Mitani et al., “Detection of anaemia from retinal fundus images via deep learning,” Nat Biomed Eng, vol. 4, no. 1, pp. 18–27, Jan. 2020.

    Article  PubMed  Google Scholar 

  11. Z. Kang, E. Xiao, Z. Li, and L. Wang, “Deep Learning Based on ResNet-18 for Classification of Prostate Imaging-Reporting and Data System Category 3 Lesions,” Acad. Radiol., Jan. 2024.

  12. L. Alzubaidi et al., “Trustworthy deep learning framework for the detection of abnormalities in X-ray shoulder images,” PLoS One, vol. 19, no. 3, p. e0299545, Mar. 2024.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim, “Sanity Checks for Saliency Maps,” arXiv [cs.CV], 08-Oct-2018.

  14. J. Zhang, H. Chao, G. Dasegowda, G. Wang, M. K. Kalra, and P. Yan, “Revisiting the Trustworthiness of Saliency Methods in Radiology AI,” Radiol Artif Intell, vol. 6, no. 1, p. e220221, Jan. 2024.

    Article  PubMed  Google Scholar 

  15. N. Arun et al., “Assessing the Trustworthiness of Saliency Maps for Localizing Abnormalities in Medical Imaging,” Radiol Artif Intell, vol. 3, no. 6, p. e200267, Nov. 2021.

    Article  PubMed  PubMed Central  Google Scholar 

  16. A. Saporta et al., “Benchmarking saliency methods for chest X-ray interpretation,” Nature Machine Intelligence, vol. 4, no. 10, pp. 867–878, Oct. 2022.

    Article  Google Scholar 

  17. W. Jin, X. Li, and G. Hamarneh, “One Map Does Not Fit All: Evaluating Saliency Map Explanation on Multi-Modal Medical Images,” arXiv [cs.CV], 11-Jul-2021.

  18. P. Rajpurkar et al., “MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs,” arXiv [physics.med-ph], 11-Dec-2017.

  19. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” arXiv [cs.CV], 02-Dec-2015.

  20. G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” arXiv [cs.CV], 25-Aug-2016.

  21. S. S. Halabi et al., “The RSNA Pediatric Bone Age Machine Learning Challenge,” Radiology, vol. 290, no. 2, pp. 498–503, Feb. 2019.

    Article  PubMed  Google Scholar 

  22. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization,” arXiv [cs.CV], 07-Oct-2016.

  23. K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,” arXiv [cs.CV], 20-Dec-2013.

  24. M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, Sydney, NSW, Australia, 2017, pp. 3319–3328.

  25. D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, “SmoothGrad: removing noise by adding noise,” arXiv [cs.LG], 12-Jun-2017.

  26. A. Kapishnikov, T. Bolukbasi, F. Viégas, and M. Terry, “XRAI: Better Attributions Through Regions,” arXiv [cs.CV], 06-Jun-2019.

  27. J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for Simplicity: The All Convolutional Net,” arXiv [cs.LG], 21-Dec-2014.

  28. R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “Grad-CAM: Why did you say that?,” arXiv [stat.ML], 22-Nov-2016.

  29. S. Hooker, D. Erhan, P.-J. Kindermans, and B. Kim, “A benchmark for interpretability methods in deep neural networks,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA: Curran Associates Inc., 2019, pp. 9737–9748.

  30. J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, no. 6, pp. 679–698, Jun. 1986.

    Article  CAS  PubMed  Google Scholar 

  31. M. He, X. Wang, and Y. Zhao, “A calibrated deep learning ensemble for abnormality detection in musculoskeletal radiographs,” Sci. Rep., vol. 11, no. 1, p. 9097, Apr. 2021.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. J. Irvin et al., “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison,” arXiv [cs.CV], 21-Jan-2019.

  33. L. Oakden-Rayner, J. Dunnmon, G. Carneiro, and C. Re, “Hidden stratification causes clinically meaningful failures in machine learning for medical imaging,” in Proceedings of the ACM Conference on Health, Inference, and Learning, Toronto, Ontario, Canada, 2020, pp. 151–159.

  34. G. Yona and D. Greenfeld, “Revisiting Sanity Checks for Saliency Maps,” arXiv [cs.LG], 27-Oct-2021.

  35. S. Lundberg and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions,” arXiv [cs.AI], 22-May-2017.

  36. J. Teneggi, A. Luster, and J. Sulam, “Fast Hierarchical Games for Image Explanations,” arXiv [cs.CV], 13-Apr-2021.

  37. J. Teneggi, B. Bharti, Y. Romano, and J. Sulam, “SHAP-XRT: The Shapley Value Meets Conditional Independence Testing,” Transactions on Machine Learning Research, 11-Jul-2023.

  38. Z. Liu, E. Adeli, K. M. Pohl, and Q. Zhao, “Going Beyond Saliency Maps: Training Deep Models to Interpret Deep Models,” Inf. Process. Med. Imaging, vol. 12729, pp. 71–82, Jun. 2021.

    PubMed  PubMed Central  Google Scholar 

Download references

Funding

This work was supported by the Johns Hopkins Biomedical Engineering Leong Undergraduate Research Fund. J.S. is supported by NSF CAREER Award CCF 2239787.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by K.V. The first draft of the manuscript was written by K.V. and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Paul H. Yi.

Ethics declarations

Ethics Approval

This study used publicly available data and was considered not human subjects research. Therefore, institutional review board review was waived.

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 117 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Venkatesh, K., Mutasa, S., Moore, F. et al. Gradient-Based Saliency Maps Are Not Trustworthy Visual Explanations of Automated AI Musculoskeletal Diagnoses. J Digit Imaging. Inform. med. (2024). https://doi.org/10.1007/s10278-024-01136-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10278-024-01136-4

Keywords

Navigation