Original ResearchOpen Access

Deep Learning Detection of Active Pulmonary Tuberculosis at Chest Radiography Matched the Clinical Performance of Radiologists

Published Online:https://doi.org/10.1148/radiol.212213

Abstract

Background

The World Health Organization (WHO) recommends chest radiography to facilitate tuberculosis (TB) screening. However, chest radiograph interpretation expertise remains limited in many regions.

Purpose

To develop a deep learning system (DLS) to detect active pulmonary TB on chest radiographs and compare its performance to that of radiologists.

Materials and Methods

A DLS was trained and tested using retrospective chest radiographs (acquired between 1996 and 2020) from 10 countries. To improve generalization, large-scale chest radiograph pretraining, attention pooling, and semisupervised learning (“noisy-student”) were incorporated. The DLS was evaluated in a four-country test set (China, India, the United States, and Zambia) and in a mining population in South Africa, with positive TB confirmed with microbiological tests or nucleic acid amplification testing (NAAT). The performance of the DLS was compared with that of 14 radiologists. The authors studied the efficacy of the DLS compared with that of nine radiologists using the Obuchowski-Rockette-Hillis procedure. Given WHO targets of 90% sensitivity and 70% specificity, the operating point of the DLS (0.45) was prespecified to favor sensitivity.

Results

A total of 165 754 images in 22 284 subjects (mean age, 45 years; 21% female) were used for model development and testing. In the four-country test set (1236 subjects, 17% with active TB), the receiver operating characteristic (ROC) curve of the DLS was higher than those for all nine India-based radiologists, with an area under the ROC curve of 0.89 (95% CI: 0.87, 0.91). Compared with these radiologists, at the prespecified operating point, the DLS sensitivity was higher (88% vs 75%, P < .001) and specificity was noninferior (79% vs 84%, P = .004). Trends were similar within other patient subgroups, in the South Africa data set, and across various TB-specific chest radiograph findings. In simulations, the use of the DLS to identify likely TB-positive chest radiographs for NAAT confirmation reduced the cost by 40%–80% per TB-positive patient detected.

Conclusion

A deep learning method was found to be noninferior to radiologists for the determination of active tuberculosis on digital chest radiographs.

Published under a CC BY 4.0 license.

Online supplemental material is available for this article.

See also the editorial by van Ginneken in this issue.

Summary

A deep learning system trained to detect active pulmonary tuberculosis using de-identified data from 10 countries had a clinical performance comparable to that of nine radiologists on test data sets.

Key Results

  • ■ A deep learning system (DLS) to detect active pulmonary tuberculosis (TB) was trained using chest radiographs from 22 284 subjects.

  • ■ In a four-country test set (1236 subjects, 17% with active TB), the area under the receiver operating characteristic curve of the DLS was 0.89 (95% CI: 0.87, 0.91).

  • ■ For the test set (1236 subjects), the DLS achieved superiority in sensitivity (88% vs 75%, P < .001) and noninferiority in specificity (79% vs 84%, P = .004) compared with nine radiologists.

Introduction

Globally, one in four people are infected with Mycobacterium tuberculosis, and 5%–10% of these individuals will develop active tuberculosis (TB) during their lifetime (1,2). In 2019, the estimated worldwide mortality from TB was 1.4 million (3). Almost 90% of active TB infections occur in approximately 30 “high-burden” countries, many with scarce resources needed to address this public health problem (4). Recently, the COVID-19 pandemic has also disrupted efforts to combat TB. Recent reports showed that 21%, or 1.4 million, fewer people received care for TB in 2020 than in 2019 (3).

In the past decade, there has been steady global support to combat this health crisis through the “End TB Strategy” of the World Health Organization (WHO), the Sustainable Development Goals of the United Nations, and the Global Fund to fight AIDS, TB, and malaria (5). Cost-effective pulmonary TB screening using chest radiography has the potential to increase equity in access to health care, particularly in difficult-to-reach populations (6). There has been active research into using artificial intelligence to screen chest radiographs followed by a confirmatory nucleic acid amplification testing (NAAT) (710). These workflows have been shown to be cost-effective compared with NAAT alone and substantially increased patient throughput (8). As part of their recently published 2021 guidance, the WHO evaluated three independent computer-aided detection software systems and determined that the diagnostic accuracy and performance of computer-aided detection software was similar to those of human readers (6,8,11,12). Given the scarcity of experienced readers, the WHO now recommends computer-aided detection for both screening and triage in individuals aged 15 years or older (6).

In this study, we developed a deep learning system (DLS) to interpret chest radiographs for imaging features of active pulmonary TB. We tested our DLS using an aggregate of data sets from China, India, the United States, and Zambia that together reflect different regions, races, ethnicities, and local disease prevalence. We evaluated the DLS under two conditions: (a) having a single prespecified operating point across all data sets and (b) when customized to radiologist performance in each locale. Because diagnostic performance may be influenced by disease prevalence, we compared the DLS with two different groups of radiologists: one based in a TB-endemic region (India) and one based in a TB-nonendemic region (the United States). Finally, we estimated cost savings for using this DLS as a triaging solution for NAAT in screening settings. We designed our study to model real-world deployment scenarios and evaluated generalizability in four resource-limited, high-TB-burden areas.

Materials and Methods

Data Sets

This retrospective study leveraged data from multiple sources. Data were collected with the participant’s consent and/or de-identified in accordance with local regulatory requirements and/or reviewed by the institution’s ethics committee or institutional review board before our receipt of the data set. The study adhered to the tenets of the Declaration of Helsinki. We leveraged de-identified chest radiograph data sets spanning nine countries for training and five countries for testing the DLS (a four-country combined test set and an additional test set of miners from South Africa), for a total of 10 countries (Tables 1, 2). Data were acquired between 1996 and 2020. DLS training used 160 187 images from Europe (13) (Azerbaijan, Belarus, Georgia, Moldova, Romania), India, and South Africa, and tuning used 3258 images from China (1417), India, and Zambia (18). In addition, we used 550 297 images (19,20) for pretraining purposes (10 310 of which overlapped with the training and validation sets and none of which overlapped with the test sets) and 138 images with labeled lung segmentation masks from the United States data set for training and validating the lung cropping model. The trained DLS was tested on 1236 images from China (14,15), India, the United States (14,15), and Zambia and 1073 images from South Africa, using one image per patient for all (2309 test images). Additional details including inclusion and exclusion criteria, enrichment, and reference standard are presented in Figures 1 and E1 (online) and Tables 1 and 2. This study was funded by Google.

Table 1: Baseline Characteristics of Training and Validation Data Sets

Table 2: Baseline Characteristics of Test Data Sets

Figure 1: (A) Standards for Reporting of Diagnostic Accuracy Studies, or STARD, diagrams for validation and test data sets for Zambia, India, China, and the United States. (B) Standards for Reporting of Diagnostic Accuracy Studies diagram for validation and test data sets for South Africa. CAD = computer-aided detection, CIDRZ = Centre for Infectious Disease Research in Zambia, CXR = chest radiograph, GXP = GeneXpert (Cepheid), DICOM = Digital Imaging and Communications in Medicine, ID = identification, QA = quality assurance, TB = tuberculosis, TB– = TB negative, TB+ = TB positive. CAD4TB is a commercially available software (Delft Imaging).

Reference Standard for TB Status

For all validation and testing data sets, positive TB status was confirmed by means of microbiological testing (sputum culture or sputum smear) or NAAT (eg, GeneXpert MTB/RIF, Cepheid) or mixed clinical in the case of the United States (Montgomery, Md) and China (Shenzhen), two well known TB data sets (Tables 1, 2, and E1 [online]). In the training data sets, the reference standard varied due to site-specific practice differences and data availability, including microbiological testing, radiologist interpretation, clinical diagnoses (based on medical history and imaging), and NAAT. Additional information is provided in Appendix E1 (online).

DLS Description

Our DLS was developed to detect active pulmonary TB on chest radiographs and consists of three modules: a lung cropping model for identifying a bounding box spanning the lungs, a detection model for identifying regions containing possible imaging features of active TB (nodules, airspace opacities with cavitation, airspace opacities without cavitation, pleural effusion, granulomas, and fibroproductive lung opacities), and a classification model that takes the output from both the lung cropping model and the detection model to predict the likelihood of the chest radiograph being TB positive as well as whether there were any non-TB abnormalities (Fig 2). Further details on the model and training process are provided in Appendix E1 (online).

Figure 2: Overview of our deep learning system. The system consists of three modules: a lung cropping model to specifically crop the lungs, a detection model to identify regions of interest, and a classification model that takes the output from the other two models to predict the likelihood of the chest radiograph (CXR) being tuberculosis positive. The large-scale abnormality pretraining and noisy student semisupervised learning used to train the classification model are not shown here. “P” indicates probability, for example P(Tuberculosis) indicates the probability of the image showing signs of tuberculosis. FC = fully connected, RCNN = Region-based Convolutional Neural Network.

Comparator Radiologist Reviews

To contextualize the DLS performance across data sets of varying levels of difficulty, all test set images were reviewed by radiologists. To examine the extent to which radiologists accustomed to practice in endemic versus nonendemic settings vary, these radiologists included two groups (10 India-based consultant radiologists and five U.S.-based board-certified radiologists). All India-based radiologists had experience in reading TB images; one U.S.-based radiologist had fellowship training in body and musculoskeletal and one had fellowship training in pediatrics. Data from one radiologist from India were excluded due to low sensitivity; this exclusion is described in more detail in the Results section. The India-based radiologists had an average of 6 years of experience (range, 3–9 years), and the U.S.-based radiologists had an average of 11 years of experience (range, 3–22 years). The radiologists were provided with both the image and additional clinical information about the patient when available (age, sex, symptoms, and HIV status); clinical information was not incorporated into the DLS. The radiologists labeled each of the 2309 images in the test data sets for the presence or absence of TB, their confidence (on a scale from 0 to 100) when they indicated the presence of TB, and whether there were any minor technical issues visible on the image. The validation data sets were labeled similarly.

Statistical Analysis

Our prespecified primary analyses compared the performance of the DLS with that of the India-based radiologists on the combined four-country test data set. For comparison with radiologists, we thresholded the DLS continuous score using an operating point of 0.45, chosen using the validation data sets and before evaluating the DLS on any of the test sets. We tested for noninferiority of sensitivity and specificity, both with a 10% margin prespecified based on sample size simulations on the validation set. To account for correlations within patient and within radiologist, we used the Obuchowski-Rockette-Hillis procedure (21,22) configured for binary data (23) and adapted to compare readers with the standalone algorithm (24) in a noninferiority setting (25). P < .0125 was considered indicative of a significant difference for the primary analyses (a conservative one-sided α of .025 was halved for a Bonferroni correction for two tests). Subsequent superiority testing was prespecified if noninferiority was met, which does not require multiple testing correction (26).

Prespecified secondary analyses included per–data set subgroup analysis for receiver operating characteristic (ROC) curves; sensitivity and specificity at the prespecified operating point; operating points corresponding to the WHO target sensitivity and specificity; matched sensitivity and specificity analysis on a per–data set and per-radiologist level; and comparisons of the India-based and U.S.-based radiologists and comparison of the DLS to the U.S.-based radiologists. Additional secondary analyses were performed in subgroups based on HIV status, images flagged by the reviewing radiologists to have minor technical issues, demographic information, and symptoms. Exploratory sputum smear (27) subgroup analysis was also conducted. Unless otherwise specified, 95% CIs were calculated using the bootstrap method with 1000 samples. We evaluated the performance of the DLS and radiologists on 247 randomly chosen TB-positive chest radiographs (the entirety of positive images from the four-country test set and a sample from South Africa) stratified by the presence of TB-related radiologic findings on chest radiographs. TB-related findings were labeled by a U.S.-based radiologist (C.L., with 17 years of post-residency experience). Due to small sample sizes, P values were not calculated in this smaller subgroup analysis. Finally, we conducted a simulated cost analysis (Appendix E1 [online]). All analyses were conducted using Python version 3.7.10 and the libraries numpy v1.19.5, scipy v1.2.1, sklearn v0.24.1, and pandas v1.1.5.

Results

Data Set Characteristics

We had two test sets. Our four-country (China, India, the United States, and Zambia) test set included 1236 patients with a mean age of 38 years ± 20 (SD) (483 female subjects [39%]), and our one-country (South Africa) test data set included 1073 patients with a mean age of 43 years ± 8 (16 female subjects [1%]). DLS performance was first evaluated on a combined four-country test data set incorporating diverse subjects representing multiple races and ethnicities (Tables 1, 2). Of 1236 images (from 1236 subjects), 212 subjects were positive for TB based on culture or NAAT. Patient sources for these four data sets included TB referral centers, outpatient clinics, and active case findings. The India test data set was from a site independent of those used for training and validation of the model. An independent test data set from South Africa comprising a sample of mostly male subjects from a gold mining population served as an additional test set.

DLS Performance

In our combined four-country test data set, the DLS attained an area under the ROC curve (AUC) of 0.89 (95% CI: 0.87, 0.91) (Fig 3A). To contextualize the performance of the model, two groups of radiologists (from India [TB endemic] and the United States [TB not endemic]) reviewed the same images. One India-based radiologist was found to have a rate of flagging positives (and consequently sensitivity) substantially below the others (Fig E2 [online]) and so was excluded from subsequent analyses to avoid under-representing radiologist performance. The ROC curve of the DLS was above the sensitivity and specificity points of all nine remaining India-based radiologists (range of sensitivity increase at matching specificity: 1.9%–14.2%, P < .05 met for six of nine radiologists) (Fig 3A).

Figure 3: Receiver operating characteristic (ROC) curves for the deep learning system (DLS) compared with radiologists on (A) a combined data set comprising four countries and (B) each data set individually. ROC curves for the DLS compared with radiologists on (C) subgroups based on HIV status in the Zambia data set and (D) an additional test data set from a mining population in South Africa. AUC = area under the ROC curve.

Our prespecified primary analyses involved comparing the DLS at a prespecified operating point with India-based radiologists (Table 3). The sensitivity of the DLS (88%) was higher than that of the India-based radiologists (mean sensitivity, 75%; P < .001). At the same operating point, the specificity of the DLS (79%) was similarly noninferior to that of the India-based radiologists (mean specificity, 84%; P = .004). The distribution of the India-based radiologists’ sensitivity and specificity is shown in Figure E2 (online).

Table 3: Comparison of the Sensitivity and Specificity of the DLS and Radiologists

Comparison of India-based and U.S.-based Radiologists

Although both India-based and U.S.-based radiologists had sensitivities and specificities that tracked closely and slightly below the ROC curve of our model (for U.S.-based radiologists, the range of sensitivity increase at matching specificity was 0.9%–15.6%, P < .05 met for four of five radiologists), the conservativeness with which the two groups of radiologists called images as positive for TB appeared to differ (Figs 3A–C, E2 [online]). India-based radiologists were more specific than U.S.-based radiologists (mean specificity, 84.0% vs 74.4%, respectively; P = .019, exact one-tailed permutation test) and trended toward less sensitivity (mean sensitivity, 74.6% vs 77.9%; P = .299). The sensitivity and specificity of the DLS remained comparable to that of the U.S.-based radiologists (P value for noninferiority <.05 for both sensitivity and specificity) (Table 3).

Per–Data Set Analysis

Results of detailed per-country subgroup analysis for the combined test data set are presented in Tables 3 and E7 (online) and Figure 3B and described in Appendix E1 (online). For the additional independent test data set from a mining population in South Africa (Fig 3D, Table E3 [online]), the ROC curve of the model was above the curves for all but one radiologist. At the same prespecified operating point as the other data sets, the DLS was noninferior both in terms of sensitivity and specificity to both India-based and U.S.-based radiologists (P < .05 for all). At a higher specificity operating point selected based on the South Africa validation data sets, the DLS was again noninferior in both sensitivity and specificity compared with the India-based radiologists but had higher specificity (P = .01) than the U.S.-based radiologists at the cost of not being noninferior in sensitivity (P = .59) (Table E3 [online]). When matching DLS sensitivity to mean radiologist sensitivity per data set, the DLS was non-inferior to both India-based and U.S.-based radiologists (Table E7 [online]).

Inter–Data Set Comparisons

To better understand inter–data set differences, histograms of DLS prediction scores were plotted separately for images from subjects who were TB positive or negative (Fig 4). The distribution of DLS scores for both TB-positive and TB-negative images remained similar across the China, India, and U.S. data sets (Table E6 [online]). However, there was a higher proportion of TB-negative images with high DLS scores in the Zambia data set. This appears to have been a consequence of first-round computer-aided detection screening of the Zambia data set, which censored many normal-appearing chest radiographs, resulting in a more challenging data set with a relative paucity of normal chest radiographs.

Figure 4: Histograms show the distribution of deep learning system (DLS) predictions stratified by positive (red) versus negative (blue) examples to illustrate shifts across data sets.

Visualizing Challenging Images

We next inspected chest radiographs to better visualize the types of images that are challenging for the DLS. In Figure 5, we show images where the DLS provided the correct interpretation. In images where the DLS had confident predictions (where the DLS value was close to 1 or 0), there is high correlation with radiologists’ interpretations. In challenging predictions (where the DLS value was close to the operating point, 0.45), radiologist performance also decreased. In two challenging TB-positive images that the DLS predicted incorrectly (Fig 6), radiologist annotations indicated lymphadenopathy and miliary infection. In two challenging TB-negative images that the DLS predicted incorrectly, radiologist annotation indicated findings consistent with bronchiolitis and left lower lung infection.

Figure 5: Examples of chest radiographs for which the deep learning system (DLS) provided the correct interpretation, corresponding to (A) tuberculosis (TB)–positive subjects and (B) TB-negative subjects. Blue outlines encircle salient regions via Grad-CAM (35) that most influence the positive prediction from the DLS and are shown when the DLS considered the image positive. Yellow outlines were annotated by a radiologist to indicate regions of interest; solid outlines indicate findings consistent with TB while dotted outlines indicate other findings. “Confident” indicates that the DLS predicted values close to 0 or 1, whereas “challenging” indicates that the DLS predicted values close to the operating point (0.45). CXR = chest radiograph, rads = radiologists.

Figure 6: Examples of chest radiographs for which the deep learning system (DLS) provided the wrong interpretation, corresponding to (A) tuberculosis (TB)–positive subjects and (B) TB-negative subjects. Blue outlines encircle salient regions via Grad-CAM (35) that most influence the positive prediction from the DLS and are shown when the DLS considered the image positive. Yellow outlines were annotated by a radiologist to indicate regions of interest; solid outlines indicate findings consistent with TB while dotted outlines indicate other findings. “Confident” indicates that the DLS predicted values close to 0 or 1, whereas “challenging” indicates that the DLS predicted values close to the operating point (0.45). rads = Radiologists.

Matched Performance to Radiologists and WHO Targets

To facilitate comparisons despite the wide range in radiologist sensitivities and specificities, both across data sets and readers, we next conducted a matched analysis by shifting the DLS operating point on a per–data set level to (a) compare sensitivities at mean radiologist specificity and (b) compare specificities at mean radiologist sensitivity, as detailed in Tables 4 and E7 (online) as well as in Appendix E1 (online).

Table 4: Comparing Performance of the DLS to Radiologists after Matching to Mean Radiologist Specificity per Data Set

The WHO “target product profile” for a TB screening test recommends a sensitivity greater than 90% and a specificity greater than 70%. In matched performance analysis at 90% sensitivity, the DLS had a specificity of 77% on the combined data set; at 70% specificity, the DLS had a sensitivity of 93%. Both of these met the recommendations. This finding remained true in the China, India, and U.S. data sets but not in the enriched Zambia data set (Table 5).

Table 5: Model Performance at the WHO’s Target Sensitivity and Specificity Thresholds

Subgroups according to HIV Status

We next considered subgroups based on HIV status where available, which corresponded to most subjects in the Zambia data set. In the HIV-negative confirmed subgroup, the DLS had an AUC of 0.92 (95% CI: 0.85, 0.97). In the HIV-positive confirmed subgroup, the DLS had an AUC of 0.80 (95% CI: 0.79, 0.89). A similar lower sensitivity and specificity were observed for radiologist assessments (Fig 3C). The DLS remained comparable to the radiologists for both subgroups, notably even though HIV status information was not integrated within the DLS. Additional subgroup analyses are presented in Figures E3–E9 (online).

Detecting Non-TB Pulmonary Findings

We further evaluated the “abnormality” detector in the DLS on the India and Zambia test sets, using labels provided by three India-based radiologists as the reference standard (Appendix E1 [online]). A positive image was defined as either being TB-positive or having another abnormal chest radiograph finding. Depending on how many radiologists indicated the presence of the abnormality, the AUC of the DLS ranged from 0.93 to 0.97 (Fig E6 [online]).

Stratification according to TB-associated Radiologic Findings

We found that our DLS had high sensitivity (range, 89%–100%) for all TB-associated radiologic abnormalities (Table 6). All DLS sensitivities were higher than the mean India-based and U.S.-based radiologist sensitivities. For images without any of our 10 TB-specific findings, sensitivity was lower for all: 33% for the DLS, a mean of 25% ± 8 for the India-based radiologists, and a mean of 45% ± 15 for the U.S.-based radiologists.

Table 6: Efficacy for Various Signs for the Diagnosis of Mycobacterium tuberculosis by the DLS in a Subgroup of 247 Subjects with a Positive Nucleic Acid Amplification Test

Cost Analysis

Finally, in our analysis of potential cost savings, we simulated a workflow where subjects only proceed to NAAT if flagged as positive by the DLS, as detailed in Appendix E1 (online). On the basis of the performance in the India data set, as prevalence decreases from 10% to 1%, the cost per TB-positive patient detected increased substantially and the cost savings compared with use of NAAT alone increased from 73% to 82% (Fig 7). The corresponding cost savings at low prevalence is not as profound when simulating the WHO target (range, 47%–53%) and a lower-specificity device (range, 42%–48%).

Figure 7: Graph shows the estimated cost per tuberculosis (TB)–positive patient detected using the deep learning system (DLS). Absolute cost on the y-axis represents the expected cost per TB-positive patient detected. See the “Cost Analysis” section in Appendix E1 (online) for more details; briefly, the cost of chest radiography (CXR) was set at U.S. $1.49 and that of GeneXpert testing (Cepheid) at U.S. $13.06. In the “baseline” case, all subjects underwent GeneXpert testing. For the other three cases, all cases underwent chest radiography and only the subset flagged as positive by a DLS with a certain sensitivity and specificity (described in the inset) underwent GeneXpert testing. The numbers 90/65 and 90/70 indicate the sensitivity and specificity; for example, 90/70 indicates 90% sensitivity and 70% specificity. WHO = World Health Organization.

Discussion

To achieve the long-term public health vision of eliminating tuberculosis (TB) globally, there is a pressing need to scale up identification and treatment in resource-constrained settings. We developed a deep learning system (DLS) using data from nine countries and tested the DLS on data from five countries, together covering multiple countries with a high TB burden and a wide range of race and ethnicities as well as clinical settings. Our DLS, at the prespecified operating point, was noninferior to radiologists in identifying active TB on digital chest radiographs in both test sets, including a traditional test set composed of patients from four countries and a second test set from a mining population.

The development of a DLS with high clinical performance across a broad spectrum of patient settings has the potential to equip public health organizations and health care providers with a powerful tool to reduce inequities in efforts to screen and triage TB throughout the world. Many TB-prevalent areas have a shortage of radiologists, especially when screening is done in remote sites by community workers. Even when radiologists are available in another location, it can be difficult to have the image read quickly enough to inform patients before they leave, which may improve patient adherence to subsequent follow-up (1,28). Although NAAT techniques have high positive predictive value, the implementation is limited by cost. However, if coupled with an inexpensive but relatively sensitive first-line filter like chest radiography (ie, only patients with positive findings at chest radiography are tested using NAAT), the benefits of NAAT could effectively benefit a larger population due to more targeted use. In these settings, in accordance with current WHO guidelines, a computer-aided detection system with high clinical performance can increase the viability of this strategy by serving as an effective alternative to human readers. Our cost analysis of this two-stage screening workflow using the DLS suggests that it has the potential to provide 40%–80% cost savings at 1%–10% prevalence. The cost savings increase further as prevalence falls, which is an important financial consideration in disease eradication.

When evaluating each country individually, we found that the DLS consistently performed well. The DLS performance was excellent in two commonly used case-control data sets from China and the United States. Performance also generalized to external test data sets from a TB screening center in India and a mining population in South Africa. Moreover, the performance of the DLS was maintained in the enriched Zambia data set, which was more difficult because it was prefiltered by another computer-aided detection device and uncomplicated TB-negative cases were removed. Because many images that were considered radiologically clear were excluded from this data set, the difficulty of triaging the remaining subjects was likely increased. In addition, the DLS also performed well when radiologists indicated minor technical issues with the image, indicating translatability to real-world issues.

Although there have been previous studies looking at artificial intelligence detection of TB in chest radiography, our study stands out in its inclusion and comprehensive subgroup evaluation of a diverse patient population to better understand the DLS performance (7,8,11,12,18,29). In addition to performing well on data sets from different countries with a wide range of race and ethnicities, we further verified that the DLS remained comparable to the radiologists in subjects with differing demographics (age, sex), in those without a history of TB, in those with sputum smear results, and with subgroups consistent with the four-symptom screen recommended by the WHO: cough, weight loss, fever, or night sweats.

Importantly, we found that our DLS performed well in groups of patients who are known to have atypical chest radiographs at baseline or atypical pulmonary manifestations of active TB. Our second test set consisted of subjects from a gold mining population in South Africa, a group identified by the WHO for systemic screening. In addition to having a high prevalence of TB, the mining population is also known to have a high prevalence of baseline pulmonary abnormalities like silicosis, emphysema, and chronic obstructive pulmonary disease. We found that our DLS remained comparable to radiologists in this population. Another vulnerable group is the HIV-infected population, which has up to a 40-fold increased risk of active TB compared with background rates in the general population (1,30). Patients with HIV-associated pulmonary TB often have atypical presentations on chest radiographs, making them more difficult to screen (31). We found that our DLS also remained comparable with radiologists in this subset. However, both radiologists and DLS had poorer performance.

As part of our comprehensive model evaluation, we found that our DLS had high sensitivity in the evaluation of images with TB-associated radiologic findings but did not perform as well when no obvious TB-associated abnormality was present. This performance pattern was similar for both India-based and U.S.-based radiologists, which suggests that the DLS may similarly depend on the presence of radiologic features to identify positive TB images. Finally, the DLS was able to accurately detect non-TB abnormalities that were identified by radiologists. Such a capability addresses one of the drawbacks that traditional computer-aided detection systems were noted to have by the WHO: That unlike human readers, the computer-aided detection systems could not simultaneously screen for pulmonary or thoracic conditions.

Our comprehensive analysis with a large group of radiologists also revealed several important subtleties. First, irrespective of practice location, radiologists demonstrated a wide range of sensitivities and specificities. This variability is documented in the literature, with clinical experience being a potential contributing factor (6,3234). Second, radiologists practicing in India were generally more specific and less sensitive than those practicing in the United States, potentially due to local clinical practice patterns (eg, favoring testing of borderline cases) and availability of testing resources, rather than risking missing cases of a contagious disease.

Our study has limitations. First, this study was retrospective; prospective validation studies are being planned to better understand challenges in integrating into real-world workflows. As our data sets all had relatively high prevalence, we will need to evaluate its performance in populations with lower prevalence. Finally, the cost analysis is a simulation that makes simplifying assumptions such as DLS performance being unaffected by prevalence changes. In practice, prevalence changes may be associated with disease severity, which may affect DLS sensitivity or specificity.

In conclusion, we developed a deep learning system (DLS) that was noninferior to radiologists and demonstrated its generalization via international test data sets spanning five countries representing a wide range of races and ethnicities: China, India, the United States, Zambia, and South Africa. Our South Africa test set is composed of a mining population, which is considered a high-risk population by the World Health Organization (WHO) and may also have had other mining-specific pulmonary conditions. The DLS further meets the WHO’s targets when matching to either 90% sensitivity or 70% specificity. The DLS may be able to facilitate tuberculosis screening in areas with limited radiologist resources and merits further prospective clinical validation.

Disclosures of conflicts of interest: S.K. Employee of Google; funding for conference attendance; stock in Alphabet. J.Y. No relevant relationships. S.J. Employee of Google; patents pending with Google; stock in Alphabet. R.P. Employee of Google; patent pending with Google; stock in Alphabet. Z.N. Patents planned issued, or pending for Google; stock or stock options in Alphabet. C.C. Stock in Alphabet. N.B. No relevant relationships. C.L. Consulting fees from Google. S.M.M. Stock or stock options in Alphabet. T.H. Stock in Alphabet. A.P.K. Patents planned, issued or pending for Google; stock or stock options in Alphabet. S.R.K. No relevant relationships. M.M. No relevant relationships. J.M. Head of Health and principal investigator for the study as registered with the ethics committee; received no financial compensation for this work above full-time employment. T.S. No relevant relationships. G.S.C. Support for attending meetings/travel from Google; Google may plan patents related to this work; stock or stock options in Alphabet. L.P. Employee of Google; stock in Alphabet. K.C. Leadership or fiduciary role in other board, society, committee or advocacy group, paid or unpaid, from Google; stock or stock options in Alphabet. P.H.C.C. Employee of Google; stock in Alphabet; several patents granted or pending on machine learning models for medical images with Google. Y.L. Employee of Google; stock in Alphabet. K.E. Employee of Google. D.T. Support for attending meetings and/or travel from Alphabet; patents issued; stock in Alphabet. S.S. Employed by Google; stock in Alphabet. S.P. No relevant relationships.

Acknowledgments

The authors thank the members of the Google Health Radiology and labeling software teams for software infrastructure support, logistical support, and assistance in data labeling. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study, Jonny Wong, BA, for coordinating the imaging annotation work, Anna Majkowska, PhD, for initial modeling efforts, Joshua Reicher, MD, for early input, Christopher Semturs, MS, for team guidance, Rayman Huang, PhD, for statistical input, Thidanun Saensuksopa, MHCI, for figure and user interface design, and Akinori Mitani, MD, PhD, and Craig H. Mermel, MD, PhD, for manuscript feedback.

Author Contributions

Author contributions: Guarantors of integrity of entire study, S.K., J.Y., Z.N., J.M., G.S.C., S.S.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, S.K., J.Y., S.J., Z.N., C.C., C.L., T.H., A.P.K., J.M., T.S., L.P., Y.L., K.E., S.P.; clinical studies, J.Y., C.C., C.L., M.M., J.M., L.P., K.C., K.E., D.T., S.S.; experimental studies, S.K., J.Y., S.J., R.P., Z.N., C.L., T.H., A.P.K., J.M., L.P., P.H.C.C., K.E., D.T., S.S.; statistical analysis, S.K., J.Y., S.J., Z.N., S.M.M., T.H., S.R.K., J.M., T.S., L.P., P.H.C.C., Y.L., S.S.; and manuscript editing, S.K., J.Y., S.J., R.P., Z.N., C.C., N.B., C.L., S.M.M., A.P.K., S.R.K., M.M., J.M., T.S., G.S.C., L.P., P.H.C.C., Y.L., K.E., D.T., S.S., S.P..

* S.K. and J.Y. contributed equally to this work.

** S.S. and S.P. are co-senior authors.

Supported by Google.

Data sharing: Data analyzed during the study were provided by a third party. Requests for data should be directed to the provider indicated in the Acknowledgments.

References

  • 1. World Health Organization. Global Tuberculosis Report 2020. https://www.who.int/publications/i/item/9789240013131. Accessed December 11, 2020. Google Scholar
  • 2. World Health Organization. Latent tuberculosis infection: updated and consolidated guidelines for programmatic management. Geneva, Switzerland: World Health Organization, 2018. Google Scholar
  • 3. Roberts L. How COVID hurt the fight against other dangerous diseases. Nature 2021;592(7855):502–504. Crossref, MedlineGoogle Scholar
  • 4. Stop TB Partnership. https://www.stoptb.org/securing-quality-tb-care-all/high-burden-countries-tuberculosis. Accessed January 5, 2021. Google Scholar
  • 5. World Health Organization. UN General Assembly adopts Declaration of the first-ever United Nations High Level Meeting on TB. https://www.who.int/news/item/11-10-2018-un-general-assembly-adopts-declaration-of-the-first-ever-united-nations-high-level-meeting-on-tb. Published 2018. Accessed May 10, 2021. Google Scholar
  • 6. World Health Organization. WHO consolidated guidelines on tuberculosis: Module 2: screening – systematic screening for tuberculosis disease. Geneva, Switzerland: World Health Organization, 2021. Google Scholar
  • 7. Khan FA, Majidulla A, Tavaziva G, et al. Chest x-ray analysis with deep learning-based software as a triage test for pulmonary tuberculosis: a prospective study of diagnostic accuracy for culture-confirmed disease. Lancet Digit Health 2020;2(11):e573–e581. Crossref, MedlineGoogle Scholar
  • 8. Murphy K, Habib SS, Zaidi SMA, et al. Computer aided detection of tuberculosis on chest radiographs: An evaluation of the CAD4TB v6 system. Sci Rep 2020;10(1):5492. Crossref, MedlineGoogle Scholar
  • 9. Qin ZZ, Naheyan T, Ruhwald M, et al. A new resource on artificial intelligence powered computer automated detection software products for tuberculosis programmes and implementers. Tuberculosis (Edinb) 2021;127:102049. Crossref, MedlineGoogle Scholar
  • 10. Rajpurkar P, Joshi A, Pareek A, et al. CheXpedition: Investigating Generalization Challenges for Translation of Chest X-Ray Algorithms to the Clinical Setting. arXiv preprint arXiv:2002.11379. http://arxiv.org/abs/2002.11379. Posted February 26, 2020. Accessed October 27, 2021. Google Scholar
  • 11. Engle E, Gabrielian A, Long A, Hurt DE, Rosenthal A. Performance of Qure.ai automatic classifiers against a large annotated database of patients with diverse forms of tuberculosis. PLoS One 2020;15(1):e0224445. Crossref, MedlineGoogle Scholar
  • 12. Qin ZZ, Sander MS, Rai B, et al. Using artificial intelligence to read chest radiographs for tuberculosis detection: A multi-site evaluation of the diagnostic accuracy of three deep learning systems. Sci Rep 2019;9(1):15000. Crossref, MedlineGoogle Scholar
  • 13. Rosenthal A, Gabrielian A, Engle E, et al. The TB Portals: an Open-Access, Web-Based Platform for Global Drug-Resistant-Tuberculosis Data Sharing and Analysis. J Clin Microbiol 2017;55(11):3267–3282. Crossref, MedlineGoogle Scholar
  • 14. Jaeger S, Candemir S, Antani S, Wáng YXJ, Lu PX, Thoma G. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant Imaging Med Surg 2014;4(6):475–477. MedlineGoogle Scholar
  • 15. Jaeger S, Karargyris A, Candemir S, et al. Automatic tuberculosis screening using chest radiographs. IEEE Trans Med Imaging 2014;33(2):233–245. Crossref, MedlineGoogle Scholar
  • 16. Candemir S, Jaeger S, Palaniappan K, et al. Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE Trans Med Imaging 2014;33(2):577–590. Crossref, MedlineGoogle Scholar
  • 17. National Library of Medicine. LHNCBC Download - Dataset. https://lhncbc.nlm.nih.gov/LHC-downloads/dataset.html. Accessed February 3, 2022. Google Scholar
  • 18. Kagujje M, Chilukutu L, Somwe P, et al. Active TB case finding in a high burden setting; comparison of community and facility-based strategies in Lusaka, Zambia. PLoS One 2020;15(9):e0237931. Crossref, MedlineGoogle Scholar
  • 19. Majkowska A, Mittal S, Steiner DF, et al. Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation. Radiology 2020;294(2):421–431. LinkGoogle Scholar
  • 20. Nabulsi Z, Sellergren A, Jamshy S, et al. Deep learning for distinguishing normal versus abnormal chest radiographs and generalization to two unseen diseases tuberculosis and COVID-19. Sci Rep 2021;11:15523. Crossref, MedlineGoogle Scholar
  • 21. Obuchowski NA, Rockette HE. Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests an anova approach with dependent observations. Commun Stat Simul Comput 1995;24(2):285–308. CrossrefGoogle Scholar
  • 22. Hillis SL. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Stat Med 2007;26(3):596–619. Crossref, MedlineGoogle Scholar
  • 23. Chen W, Wunderlich A, Petrick N, Gallas BD. Multireader multicase reader studies with binary agreement data: simulation, analysis, validation, and sizing. J Med Imaging (Bellingham) 2014;1(3):031011. Crossref, MedlineGoogle Scholar
  • 24. Chakraborty DP. Observer performance methods for diagnostic imaging: foundations, modeling, and applications with r-based examples. Boca Raton, Fla: CRC Press, 2017. CrossrefGoogle Scholar
  • 25. Chen W, Petrick NA, Sahiner B. Hypothesis testing in noninferiority and equivalence MRMC ROC studies. Acad Radiol 2012;19(9):1158–1165. Crossref, MedlineGoogle Scholar
  • 26. Committee for Proprietary Medicinal Products. Points to consider on switching between superiority and non-inferiority. Br J Clin Pharmacol 2001;52(3):223–228. Crossref, MedlineGoogle Scholar
  • 27. Schumacher SG, Wells WA, Nicol MP, et al. Guidance for Studies Evaluating the Accuracy of Sputum-Based Tests to Diagnose Tuberculosis. J Infect Dis 2019;220(220 Suppl 3):S99–S107. Crossref, MedlineGoogle Scholar
  • 28. Rønby PE, Jorge C, Mahbuba K, et al. Redesigning Clinical Pathways for Immediate Diabetic Retinopathy Screening Results. NEJM Catal Innov Care Deliv 2021;2(8):CAT.21.0096. Google Scholar
  • 29. Qin ZZ, Ahmed S, Sarker MS, et al. Tuberculosis detection from chest x-rays for triaging in a high tuberculosis-burden setting: an evaluation of five artificial intelligence algorithms. Lancet Digit Health 2021;3(9):e543–e554. Crossref, MedlineGoogle Scholar
  • 30. Selwyn PA, Hartel D, Lewis VA, et al. A prospective study of the risk of tuberculosis among intravenous drug users with human immunodeficiency virus infection. N Engl J Med 1989;320(9):545–550. Crossref, MedlineGoogle Scholar
  • 31. Kistan J, Laher F, Otwombe K, et al. Pulmonary TB: varying radiological presentations in individuals with HIV in Soweto, South Africa. Trans R Soc Trop Med Hyg 2017;111(3):132–136. Crossref, MedlineGoogle Scholar
  • 32. Pinto LM, Pai M, Dheda K, Schwartzman K, Menzies D, Steingart KR. Scoring systems using chest radiographic features for the diagnosis of pulmonary tuberculosis in adults: a systematic review. Eur Respir J 2013;42(2):480–494. Crossref, MedlineGoogle Scholar
  • 33. van’t Hoog A, Viney K, Biermann O, Yang B, Leeflang MMG, Langendam MW. Symptom‐ and chest‐radiography screening for active pulmonary tuberculosis in HIV‐negative adults and adults with unknown HIV status. Cochrane Database of Systematic Reviews. https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD010890.pub2/full. Published online March 23, 2022. Google Scholar
  • 34. Piccazzo R, Paparo F, Garlaschi G. Diagnostic accuracy of chest radiography for the diagnosis of tuberculosis (TB) and its role in the detection of latent TB infection: a systematic review. J Rheumatol Suppl 2014;91(0):32–40. Crossref, MedlineGoogle Scholar
  • 35. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. 2017 IEEE International Conference on Computer Vision (ICCV),Venice, Italy,October 22–29, 2017.Piscataway, NJ: IEEE, 2017;618–626. Google Scholar

Article History

Received: Sept 21 2021
Revision requested: Nov 15 2021
Revision received: May 26 2022
Accepted: July 20 2022
Published online: Sept 06 2022
Published in print: Jan 2023