Deep learning models for histologic grading of breast cancer and association with disease prognosis

Jaroensri, Ronnachai; Wulczyn, Ellery; Hegde, Narayan; Brown, Trissia; Flament-Auvigne, Isabelle; Tan, Fraser; Cai, Yuannan; Nagpal, Kunal; Rakha, Emad A.; Dabbs, David J.; Olson, Niels; Wren, James H.; Thompson, Elaine E.; Seetao, Erik; Robinson, Carrie; Miao, Melissa; Beckers, Fabien; Corrado, Greg S.; Peng, Lily H.; Mermel, Craig H.; Liu, Yun; Steiner, David F.; Chen, Po-Hsuan Cameron

doi:10.1038/s41523-022-00478-y

Download PDF

Article
Open access
Published: 04 October 2022

Deep learning models for histologic grading of breast cancer and association with disease prognosis

npj Breast Cancer volume 8, Article number: 113 (2022) Cite this article

10k Accesses
17 Citations
21 Altmetric
Metrics details

Subjects

Abstract

Histologic grading of breast cancer involves review and scoring of three well-established morphologic features: mitotic count, nuclear pleomorphism, and tubule formation. Taken together, these features form the basis of the Nottingham Grading System which is used to inform breast cancer characterization and prognosis. In this study, we develop deep learning models to perform histologic scoring of all three components using digitized hematoxylin and eosin-stained slides containing invasive breast carcinoma. We first evaluate model performance using pathologist-based reference standards for each component. To complement this typical approach to evaluation, we further evaluate the deep learning models via prognostic analyses. The individual component models perform at or above published benchmarks for algorithm-based grading approaches, achieving high concordance rates with pathologist grading. Further, prognostic performance using deep learning-based grading is on par with that of pathologists performing review of matched slides. By providing scores for each component feature, the deep-learning based approach also provides the potential to identify the grading components contributing most to prognostic value. This may enable optimized prognostic models, opportunities to improve access to consistent grading, and approaches to better understand the links between histologic features and clinical outcomes in breast cancer.

Deep learning-based breast cancer grading and survival analysis on whole-slide histopathology images

Article Open access 06 September 2022

A population-level digital histologic biomarker for enhanced prognosis of invasive breast cancer

Article 27 November 2023

Deep learning for fully-automated nuclear pleomorphism scoring in breast cancer

Article Open access 08 November 2022

Introduction

Breast cancer is the most common cancer in women and one of the leading causes of cancer death worldwide¹. The heterogeneous nature of breast cancer makes its initial characterization a critical step in treatment planning and decision making. One aspect of breast cancer characterization that remains central to its prognostic classification is the Nottingham combined histologic grade (Elston-Ellis modification of Scarff-Bloom-Richardson grading system)^2,3. First described and validated over 30 years ago⁴, the Nottingham Grading System (NGS) consists of three components: mitotic count (MC), nuclear pleomorphism (NP), and tubule formation (TF), and is an important component of existing prognostic tools including the AJCC prognostic stage grouping⁵, PREDICT online prognostic classification tool⁶, and the Nottingham Prognostic Index⁷. However, while the combined histologic grade has been repeatedly shown to be associated with clinical outcomes, the task’s inherent subjectivity can also result in inter-pathologist variability that limits the generalizability of its prognostic utility^2,8. In addition, up to half of breast cancer cases are classified in routine practice as “grade 2”, an intermediate risk group with limited clinical value due to inclusion of some low and high-grade tumors³.

The application of computer vision and artificial intelligence (AI) to histopathology has seen tremendous growth in recent years and offers the potential to augment pathologist expertise and increase consistency and efficiency. Work relevant to breast cancer includes AI systems for counting mitoses^9,10,11,12, scoring nuclear pleomorphism^13,14, recognizing tumor subtypes^15,16, detecting metastases in lymph nodes^17,18, identifying biomarker status^{19,20,21,22,23}, and predicting prognosis^24,25,26. While such prior works address automated breast cancer grading, they have not specifically combined models for all three components of the Nottingham grading system and only a small number of used prognostic evaluation to complement the validation approach. Additional differences to consider across works include the machine learning approach and the image and specimen type, such as direct microscope image capture¹⁵, tissue microarray¹⁴, or whole slide images. Understanding the performance and application of such tools in the context of actual pathological review and workflows remains a critical next step for translation to clinical utility.

Uniquely, this work represents the development and combination of deep learning models for all three components of the NGS, with evaluation against reference grades provided by expert review of multiple pathologists. To further complement evaluation against pathologist grading and to explore the use of these automated tumor grading models, we analyze the prognostic value of the AI-based tumor grades. Prognostic evaluation utilizes an external test set consisting of cases from the The Cancer Genome Atlas breast invasive carcinoma (TCGA BRCA) study. This analysis demonstrates prognostic value on par with that of tumor grading provided by pathologists, providing additional validation of the AI-based Nottingham grading system (AI-NGS) and providing a potential approach to improve breast cancer classification and prognostication. By enabling grading that is both more objective (less inter-pathologist variability) and more fine grained (via availability of continuous scores for each component), the AI-NGS can combine strengths of AI with existing knowledge about the prognostic value of well-established morphological features¹⁴.

Results

Cohort characteristics

All available whole-slide images (WSIs) from three data sources were reviewed by qualified pathologists for slide-level inclusion criteria and quality assurance (see Methods). This resulted in 657 cases (1502 slides) from a tertiary teaching hospital (TTH), 98 cases (98 slides) from a medical laboratory (MLAB), and 829 cases (878 slides) from TCGA. TTH and MLAB were used for model development, while TCGA was used for evaluation only. The datasets and corresponding clinical characteristics are summarized in Table 1. For the test set, 829 TCGA cases (878 slides) were used for prognostic evaluation and 662 TCGA cases (685 slides) with available reference annotations were used for evaluation of histologic grading (case inclusion and exclusion are summarized in Supplementary Fig. 1).

Table 1 Dataset characteristics for development and evaluation of grading and outcome prediction models.

Full size table

Performance of deep learning systems for component features

We developed individual deep learning systems (DLS) for each component of the Nottingham grading system (MC, NP, TF). The scores generated by the DLS for each feature are continuous, and can then be discretized to produce integer scores (1,2,3) for comparison to pathologist grading. We evaluated the performance of each DLS on the held-out test set of WSIs from TCGA using majority vote reference annotations provided by three pathologists (approach summarized in the schematic in Fig. 1). The individual component models were evaluated at both the “patch-level” and the “slide-level” (see Methods). Patch-level performance results are summarized in Table 2 with example classifications from the individual models shown in Fig. 2. For the mitosis detection model (evaluated as a detection task), the mitotic figure F1 score was 0.60 (95% CI: 0.58, 0.62). For the patch-level classification models, the quadratic Kappa was 0.45 (95% CI: 0.41, 0.50) for NP and 0.70 (95% CI: 0.63, 0.75) for TF. Examples of patch-level predictions across entire WSIs are provided in Supplementary Fig. 2. For evaluation at the slide-level, using the majority score provided by pathologists as the reference grade, quadratic-weighted kappa was 0.81 (95% CI: 0.78, 0.84) for MC, 0.48 (95% CI: 0.43, 0.53) for NP, and 0.75 (95% CI: 0.67, 0.81) for TF (Table 2). Additional metrics to enable comparison to other studies (including unweighted kappa and precision and recall for MC) are available in Supplementary Table 1 and benchmark comparisons for slide-level grading by both pathologists and computational approaches are provided in Supplementary Table 2.

**Fig. 1: Overview of annotation, deep learning system (DLS) development, and prognostic evaluation.**

Table 2 Component model performance.

Full size table

**Fig. 2: Visualization of patch-level DLS predictions.**

One challenge in the performance evaluation of deep learning models for histologic grading is that pathologist-provided grading itself can be subject to high inter-rater variability²⁷. Given the availability of three pathologist reviews for each slide in our study (see Annotations section of Methods), we grouped slides by the combination of pathologist scores for each slide and evaluated the DLS output for the resulting groups (Fig. 3). This analysis demonstrates that the continuous nature of the DLS output can reflect the distribution of pathologist agreement, whereby the output of the deep learning models produces “intermediate scores” for cases lacking unanimous pathologist agreement. For example, a case with a majority vote score of “1” for nuclear pleomorphism may have unanimous agreement across all three pathologists, or may have one pathologist giving a higher score, and the models were found to reflect these differences. As seen in Fig. 3, as fewer pathologists indicated a score of “1” and more pathologists indicated a score of “2” or “3”, the DLS-estimated probability for a score of “1” (in green) decreased, and the estimated probability for a score of “3” (in red) increased.

**Fig. 3: Assessing slide-level classification of nuclear pleomorphism and tubule formation.**

Next, to further evaluate DLS performance in the context of known inter-rater variability we calculated both inter-pathologist and DLS-pathologist agreement. The average kappa (quadratic-weighted) for inter-pathologist agreement was 0.56, 0.36, 0.55 for MC, NP, and TF respectively, versus 0.64, 0.38, 0.68 for DLS-Pathologist agreement (Fig. 4). The kappa for inter-pathologist agreement for each individual pathologist (one vs. rest) as well for DLS-pathologist agreement are summarized in Fig. 4, demonstrating that on average the DLS provides consistent, pathologist-level agreement on grading of all three component features. The full confusion matrices for inter-pathologist agreement and for DLS agreement with the majority vote scores (patch-level and slide-level) are available in Supplementary Figs. 3 and 4.

**Fig. 4: Inter-pathologist and DLS-pathologist concordance for slide-level component scoring.**

Prognostic value of AI-NGS

We further analyzed the association of both AI-NGS and pathologist-provided grades with clinical outcomes, using the external test dataset (TCGA-BRCA) and progression free interval (PFI) as the prognostic endpoint²⁸. We conducted non-inferiority analysis comparing AI-NGS to histologic grading provided by pathologists. Based on tune set results, the planned, primary comparison for this analysis used the sum of the continuous component scores generated by AI-NGS (AI-NGS continuous sum; float value 3–9) compared to the summed discrete score provided by pathologist review (pathologist discrete sum; integer value 3–9); see PFI Analysis section of the Methods for additional details of continuous versus discrete scoring). The prognostic performance of the two approaches were similar with a c-index of 0.58 (95% CI: 0.52, 0.64) using the AI-NGS continuous sum and 0.58 (95% CI: 0.51, 0.63) using the pathologist discrete sum (Table 3; delta = 0.004, lower bound of one-sided 95% CI: −0.036). This is consistent with non-inferiority of AI-NGS relative to pathologist grading (see Statistical Analysis section of Methods for additional details).

Table 3 Prognostic performance of direct risk prediction using histologic scoring provided by DLS and pathologists.

Full size table

While the continuous scores of the AI-NGS were utilized for primary analysis based on superior performance of this approach on the tune set (Supplementary Table 3) as well as prior work in prostate cancer²⁹, additional approaches were also evaluated, including use of discrete summed scores (values 3–9) and the combined histologic grade (grade 1–3 based on the summed score⁴). For pathologist grading, majority vote and originally reported diagnostic grading were also evaluated. Performance for these various scoring configurations summarized in Supplementary Table 4. The c-index for AI-NGS approaches were similar, 0.58 (95% CI: 0.52, 0.64) for the AI-NGS continuous sum, 0.59 (95% CI: 0.53, 0.64) for the AI-NGS discrete sum, and 0.60 (95% CI: 0.55, 0.65) for the combined histologic grade. The pathologist-based approaches were also similar, ranging from 0.58 (95% CI: 0.51, 0.63) for pathologist combined histologic grades (1–3) to 0.61 (95% CI: 0.54, 0.66) for the majority vote summed score (3–9). The association of each individual grading component with prognosis was also evaluated independently (Supplementary Table 5). The highest prognostic value for deep learning-based grading of a single feature on the test set was achieved for mitotic count, with a c-index of 0.58 (95% CI: 0.53, 0.64). The pathologist’s mitotic count score gave a c-index of 0.54 (95% CI: 0.48, 0.59).

We also evaluated the prognostic value of AI-NGS in the context of established baseline variables (ER status, tumor size, nodal involvement, and age). Overall, adding AI-NGS provided improved prognostic value over the baseline variables alone (p = 0.036; likelihood ratio test of full model versus baseline model; Table 4). To better understand the potential contribution of each component feature, we performed a similar analysis using the score for each feature independently. Analysis for each component feature individually suggested improved prognostic value specifically for the AI-based mitotic count score (p = 0.041; likelihood ratio test) but not the other features (Supplementary Table 6). Additionally, in univariable hazard ratio (HR) analysis, the MC score provided the only AI-NGS feature with a p value less than 0.05 (HR = 1.30, p = 0.015; univariable HR analysis) (Table 5). In multivariable analysis adjusting for ER status, tumor size (T-category), nodal involvement, and age this corresponded to a HR of 1.29 (p = 0.061; multivariable HR analysis) (Table 5).

Table 4 Prognostic performance using summation of histologic components in combination with baseline clinical and pathologic features.

Full size table

Table 5 Cox regression on the test set using pathologist grading or AI-NGS scores and baseline variables.

Full size table

Mitotic count and Ki-67 expression

Given the potential association between MC, Ki-67, and prognosis, and the increasing interest of the clinical community in the use of Ki-67 in breast cancer^30,31, we also evaluated the correlation between MC score and Ki-67 (MKI67) gene expression in our study (Fig. 5). The MC score provided by the DLS demonstrated correlation with MKI67 expression with a correlation coefficient of 0.47 (95% CI: 0.41, 0.52) across the 827 TCGA cases with available gene expression data. For pathologist-provided MC scores over the same cases, the correlation coefficient was 0.37 (95% CI: 0.32, 0.43). This indicates an increased correlation with Ki-67 for the DLS-provided MC score as compared to MC provided by pathologist review (p = 0.002 in exploratory analysis; permutation test).

Discussion

In this study we developed deep learning models for all three components of the Nottingham histologic grading system, to perform both patch-level and slide-level prediction of these histologic features. A key feature of this work is that we use survival analyses to further evaluate our AI-NGS models using the more objective endpoint represented by clinical outcome. The performance for each component model exceeds most published benchmarks, and the model’s prediction of clinical outcome is shown to be on-par to that of pathologist-based grading. Simultaneous development of all three models enables a consistent, end-to-end DLS for Nottingham histologic grading that can also provide transparency into the underlying component features. While prior work has shown promising results for individual features^9,11,12,13 or direct prediction of final combined histologic grade²⁶, this work is unique in combining multiple patch-level component models for the Nottingham grading system. This also enables analysis of prognostic models that can take advantage of the scores for individual features.

One important challenge to accurate and useful histologic grading is the inherent subjectivity and associated inter-pathologist variability. As such, an automated DLS for histologic grading can provide internal consistency and reliability for grading any given tumor. Such models thus have the potential to be iteratively tuned and updated with expert pathologist oversight to correct error modes and stay consistent with evolving guidelines. Additionally, this study found that DLS-pathologist agreement generally avoids the high discordance that is sometimes observed between individual pathologists while overall trends for agreement across the three features were consistent with prior reports (Supplementary Table 2). Consistent, automated tools for histologic grading may help reduce discordant interpretations and mitigate the resulting complications that impact clinical care and research studies evaluating interventions, other diagnostics, and the grading systems themselves.

The continuous, consistent, and precise component scores provided by this approach also enable exploration of the individual components contributing most to the prognostic value. In our analysis, the AI-NGS provided significantly increased prognostic value relative to the baseline variables alone. Of the individual component features, MC demonstrated the strongest, independent association with PFI in our analysis, a finding consistent with prior studies^32,33. Building on this, the finding that mitotic count estimation by the DLS correlates with Ki-67 gene expression has implications for ongoing research regarding integration of Ki-67 in prognostic models in breast cancer^34,35. Also, the stronger correlation of Ki-67 with DLS mitotic count than with pathologist mitotic count suggests that for discordant mitotic figure classifications between DLS and pathologist (such as those in Fig. 2B) the DLS might in fact be providing more accurate representation of the biological ground truth (i.e., cell proliferation) than the pathologist-provided reference annotations. Additional studies using immunohistochemistry-based ground truth for training and evaluation³⁶, as well as future comparisons between automated mitotic count and automated Ki-67 immunohistochemistry quantitation may provide useful insights into these approaches. Overall, AI-NGS can be applied in future studies to large, multi-institutional datasets, minimizing complications of inter-pathologist variability and without requiring additional pathologist case review. This may in turn help refine existing regression models such as the Nottingham Prognostic Index³⁷ or Magee equations³⁸, by enabling further optimization of weights and providing consistent, precise, and automated scoring at scale.

Interestingly, in our test set, the summed continuous DLS score (floating point values in [3,9]) was not more prognostic than using a discrete, less granular, combined histologic grade (grade 1, 2, or 3). This is despite the continuous score being slightly superior on the smaller TTH “tune” data split, and is in contrast to our related work in prostate cancer where continuous, DLS-based Gleason scoring was superior to discrete grade group categories for outcome prediction²⁹. This may be due in part to the relatively large confidence intervals associated with the small rate of events as well as domain shifts between development and test sets due to inter-institutional differences or variability in slide processing and quality, especially given the diversity of tissue source sites in TCGA. Additionally, most TCGA cases only contributed a single slide, which may not always be most representative of the tumor and associated histologic features.

Limitations of this study include the following: While the training slides used in this study represent two institutions and the test set comprises multiple sites contributing to TCGA, further evaluation of generalizability to diverse cohorts and across individual tumor subtypes is warranted. Additionally, although TCGA has useful attributes as a test set for this study (eg, diversity of pre-analytical variables and tissue sites), the follow-up time is limited with a median of ~2 years. As much of the evidence for the prognostic significance of histological grading is in the setting of longer follow-up time to provide more complete recurrence event information², the relatively shorter follow-up time available for TCGA data may result in less precise estimates of prognostic value both for pathologists and AI-based grading). This is likely to have a predominant impact on analysis of ER + HER2− cases, for which progression events often happen later yet for which grading is expected to provide the best risk stratification. As such, the limited follow-up time may partially explain the relatively modest C-indices observed when considering the grading in isolation for this cohort. An additional limitation is that we were not able to control for possible confounding due to treatment differences, as this information was not available for most cases. Future work utilizing datasets with longer clinical follow-up, treatment data, and larger cohorts that enable analysis of individual tumor subtypes may allow improved prognostic evaluation and further demonstrate the method’s clinical significance. Larger, diverse datasets may also enable model development that directly predicts the progression-free survival from the tissue images, without using histologic grade as an intermediate prediction to estimate clinical outcomes. Such “weakly supervised” approaches could allow identification of new associations between survival and morphological features, potentially leading to iterative refinement of existing grading systems. Lastly, given the subjective nature of the pathologist grading used as the reference standard for model evaluation, adjudication sessions to achieve consensus scoring and mitotic figure identification may improve the quality of labels for training and evaluation, and this hypothesis could be tested in future work.

This study demonstrates the potential for deep learning approaches to provide comprehensive grading in breast cancer that is on par with pathologist review. The consistent and precise nature of these models allows for potentially improved integration into prognostic models as well as enabling opportunities to efficiently evaluate correlations between morphological and molecular features. Future work that combines AI-based grading of established histologic features with additional machine-learned features to generate improved prognostic models remains a compelling next step.

Methods

Data

This retrospective study utilized de-identified data from three sources: a TTH, a MLAB, and TCGA. Histopathology, clinical data, and Ki-67 expression data for TCGA were accessed via https://portal.gdc.cancer.gov. WSIs from TTH include original, archived hematoxylin and eosin (H&E)-stained slides and freshly cut and stained sections from archival blocks. WSIs from MLAB represent freshly cut and H&E stained sections from archival blocks. All WSIs used in this study were scanned at 0.25 μm/pixel (40×). The small number of TCGA images in the BRCA study scanned at 0.50 μm/pixel (20×) were excluded in order to ensure availability of 40x for DLS-based MC and NP grading.

All primary invasive breast carcinoma cases with available slides or blocks were reviewed for inclusion. For TTH this includes all available cases from 2005 to 2016, for MLAB this includes cases from 2002 to 2011, and for TCGA data comprises the TCGA-BRCA study with cases from 1988 to 2013. The study was approved by the Advarra institutional review board (Columbia, Maryland) and deemed exempt from informed consent as all data were retrospective and de-identified.

Whole slide image inclusion criteria

All available WSIs were reviewed by pathologists for slide-level inclusion criteria and quality assurance. Only slides containing H&E stained primary invasive breast carcinoma from formalin-fixed paraffin-embedded resection specimens were included in this study. Examples of excluded images include lymph node specimens, needle core biopsies, frozen tissue, immunohistochemistry slides, and slides containing carcinoma in-situ only.

This resulted in 1502 slides (657 cases) from TTH, 98 slides (98 cases) from MLAB, and 878 slides (829 cases) from TCGA. The slides from TTH were used for training and tuning of the models, slides from MLAB were used for tuning only, and TCGA slides represent a held-out external test set used only for evaluation of all models. For evaluating the individual DLS for each component feature, slides without a pathologist majority were excluded to ensure reliability of the reference for this performance evaluation, resulting in 685 slides / 662 cases. See Table 1 and Supplementary Fig. 1 for details regarding dataset usage and characteristics.

Annotations

Pathologist annotations were performed for segmentation of invasive carcinoma as well as for all three components of the Nottingham histologic grading system (MC, NP, and TF). Annotations for the grading components were collected as slide-level labels as well as region-level labels for specific regions of tumor (three 1 mm × 1 mm regions per slide selected by pathologists to capture representative tumor). For both slide-level and region-level annotation tasks, 3 board-certified pathologists from a cohort of 10 pathologists were randomly assigned per slide, thus resulting in triplicate annotations per region of interest and per slide. More detail as follows.

Invasive carcinoma segmentation

All regions considered to represent invasive carcinoma were annotated, with guidance to provide coarse annotations capturing regions of at least 70% tumor purity (to account for non-tumor cells present in a given region). Regions considered to be carcinoma in-situ were also annotated and used to train the invasive carcinoma model. Slides determined on initial review to contain only carcinoma in situ, microinvasive tumor, lymphovascular invasion were excluded prior to invasive carcinoma annotation.

Mitotic figures

For each slide, the initial annotating pathologist selected three separate 1 × 1 mm² regions with enriched mitotic count if present. Within each selected region, pathologists were asked to exhaustively annotate all mitotic figures. Each selected region was annotated by three pathologists, and only mitoses with agreement between at least two pathologists were used as “positive” events for training and evaluation.

Nuclear pleomorphism and tubule formation

For nuclear pleomorphism and tubule formation grading, we utilized the region boxes already selected during the mitotic figure annotation. Pathologists were asked to provide a component grade score for each selected region according to the Nottingham grading scale as though each region were independently representative of the whole tumor (acknowledging that such grading in clinical practice involves more holistic interpretation of tumor regions). Again, each region was annotated by three pathologists, and the final label for each region was based on the majority vote score.

Slide-level grading

Pathologists were asked to assess the whole slide on each component of the Nottingham grading scale, providing a 1–3 score for each component for each slide. Each slide was reviewed by three pathologists, and the majority vote was used to determine the slide-level label for training and evaluation. Additionally, as a separate source of grading, available pathology reports were reviewed and component grade scores from the original reports were recorded (referred to as “historic pathologist scores”).

Deep learning system development

We developed 4 separate deep learning models: one to segment invasive carcinoma within a WSI, and three “grading models” to predict the slide-level component score for each of the three tumor features comprising the Nottingham combined histologic grade: MC, NP, and TF. The invasive carcinoma model was used to provide tumor masks for the Nottingham grading model.

The invasive carcinoma model was trained using pathologist annotations to distinguish between 3 classes: non-tumor, carcinoma in situ, and invasive carcinoma. The resulting model was evaluated on the tune set, achieving an AUC for invasive carcinoma vs. the other 2 classes (carcinoma in situ or non-tumor) of 0.95.

To segment invasive carcinoma versus the rest of tissue when applying DLS for histologic grading, we applied the argmax function to each patch (1024 by 1024 pixels) in the output likelihood map over the entire slides (with argmax referring to the class with the highest prediction score for each patch). Then, the patches for which the model estimated invasive carcinoma as the most likely class are selected as the invasive carcinoma segmentation for other downstream computing tasks (for e.g., slide-level MC, NP, and TF scores prediction).

For providing slide-level component scores, each model is used as part of a DLS that consists of two stages. The first stage (“patch-level”) tiles the invasive carcinoma mask regions of the WSI into individual patches for input into a convolutional neural network, providing as output a continuous likelihood score (0–1) that each patch belongs to a given class. For mitotic figure detection, this score corresponds to the likelihood of the patch corresponding to a mitotic figure. For NP and TF, model output is a likelihood score for each of the three possible grade scores of 1–3. All stage 1 models were trained using the data summarized in Table 1 and Supplementary Table 7 and utilizing ResNet50x1 pre-trained on a large natural image set (“JFT”)³⁹. Stain normalization and color perturbation¹⁸ were applied and the models were trained until convergence. Hyperparameter configurations including patch size and magnification were selected independently for each component model using a combination of Vizier⁴⁰ and sample grid search. Hyperparameters and optimal configurations for each stage 1 model are summarized in Supplementary Table 8.

The second stage of each DLS assigns a slide-level feature score (1–3) for each feature (MC, NP, and TF). This is done by using the stage 1 model output to train a lightweight classifier for slide-level classification. For MC, the stage 1 output is used to calculate “mitotic density” values over the invasive carcinoma region, and the mitotic density values corresponding to the 5th, 25th, 50th, 75th, and 95th-percentiles for each slide are used as the input features for the stage 2 model. Details regarding the conversion of patch-level stage 1 output to mitotic figure detection and mitotic density are provided in the following section. For NP and TF, the stage 2 input feature set is the mean patch-level output (mean softmax value) for each possible score (1, 2, or 3) across the invasive carcinoma region. Based on tune set results, logistic regression was selected for the stage 2 classifier for MC. For NP and TF, performance of different stage 2 approaches were comparable, including logistic regression, ridge regression, and random forest. Ridge regression was selected, due to its simplicity and the ease of generating continuous component scores with this approach. All classifiers are regularized with their regularization strengths chosen via a fivefold cross-validation on the training set. For NP, additional experiments with a hand-engineered nuclear segmentation-based approach were also conducted^13,14. This approach did not improve performance in our experiments, potentially due to the wide variability in staining and cellular appearance in high-grade cases.

Mitotic figure detection and counting from patch-level stage-1 output

For the MC model, the output (likelihood score of mitotic figure) across all patches was considered as a heatmap that is used for downstream analysis. We thresholded this output probability to achieve a positive detection map. The detection threshold was set to 0.915 based on the tuning set. Because the mitosis annotations were provided by pathologists using 16 μm × 16 μm bounding boxes (about the size of one cell), we expected the detection of the model to be about the same size. To get the list of location of detection, we applied morphological erosion with a square structuring element of size 16 μm × 16 μm. This operation on two overlapping 16 µm × 16 µm regions will result in two disconnected points, allowing us to distinguish two nearby mitoses. We then performed connected-component analysis on the eroded map, and took the centroid of each connected component as the location of mitoses. This list of mitoses locations allowed counting of mitotic figures and the use of a sliding window approach to calculate the mitotic density. Mitotic density was calculated for all tiles across the entire invasive carcinoma mask (1.8 × 1.8 mm tiles with 50% overlapping stride). For evaluation, a predicted mitotic figure was considered a true positive event if there was a reference standard mitotic figure within 16 µm.

Deep learning system evaluation

We evaluated DLS performance for histologic grading at both the patch level and the slide-level using the TCGA test set and the annotations described above. Patch-level evaluation corresponds to the stage 1 model output and utilizes the annotated regions of interest as the reference standard (three 1 mm × 1 mm regions per slide each annotated by three pathologists; see Annotations section of Methods above and Supplementary Methods). For MC, the patch-level reference standard corresponds to cell-sized regions identified by at least two of three pathologists as a mitotic figure. All other cell-sized regions not meeting these criteria were considered negative for the purposes of MC evaluation. For NP and TF, the majority vote annotation for each region of interest was assigned to all patches within that region and used as the reference standard (consistent with the approach for stage 1 training labels). The maximum probability in the model output probability map is selected to obtain the final per-patch prediction. For slide-level evaluation, the majority vote for each slide-level component score was used.

For patch-level evaluation, F1 score was calculated for mitosis detection and quadratic-weighted kappa was calculated for NP and TF. For slide-level evaluation, quadratic-weighted kappa was used for all components, including inter-pathologist agreement. Use of quadratic-weighted kappa was based on informative penalization of increased distance from the reference standard. Because the weighting scheme used in the literature is variable, additional kappa weighting schemes were also assessed and reported in Supplemental Table 1.

Progression-free interval analysis

To further evaluate the histologic grading models, we also analyzed the prognostic value of DLS-based grading for predicting progression-free interval (PFI). PFI was chosen as a clinically meaningful endpoint suitable for TCGA-BRCA specifically as described previously²⁸. Of note, 18 cases with PFI events in this data correspond to “new primary tumors”, predominantly of non-breast origin according to TCGA annotations. As such, and because disease-specific progression events following a new primary tumor could not be reliably identified from the available data, these cases were censored at the disease-free interval time in our analysis, resulting in the 829 cases and 93 events summarized in Table 1.

In planned analysis, we evaluated the prognostic value of the DLS and pathologist-provided scores both in isolation and in the context of available clinicopathologic features. As a pre-specified non-inferiority test (based on development set results), we compared the prognostic value of the sum of continuous DLS-based component scores versus the sum of discrete component scores provided by pathologists for the same images. Here, continuous scores refer to the model output corresponding to any fractional value between 1 and 3 for each component and discrete scores refer to the traditional integer scores of 1, 2, or 3. Combined histologic grade based on the summed scores (3–5: grade 1; 6–7; grade 2; 8–9 grade 3) was also evaluated in additional analyses. To conduct analyses adjusting for available baseline clinicopathologic variables (TNM, ER status, age) and calculate hazard ratios, we fit multivariable Cox regression models on the test set (TCGA-BRCA) using either the DLS-based component scores or the pathologist-provided component scores. To evaluate for improved prognostic value when adding AI-NGS information to the baseline variables, we performed likelihood-ratio tests for cox models fit on baseline clinicopathologic variables alone versus models fit on baseline variables combined with grading scores. The corresponding c-index for the risk scores provided by these models on the test were also calculated (as reported in Table 4).

In order to evaluate multivariable models for prognosis, we use leave-one-out cross-validation on the TCGA test set, fitting a Cox model per fold and calculating the mean C-index across folds. Because the absolute value of the risk score depends on many fitted parameters such as base hazard, which can vary, the risk scores may not be directly comparable across folds. To remedy this problem, the output risk score is normalized to a percentile with regard to the training data.

Statistical analysis

Confidence intervals were generated via bootstrap resampling with 1000 samples. For patch-level and region-level evaluation of DLS, we performed bootstrap resampling over slides and for progression-free interval analysis, we performed bootstrap resampling over cases. All statistical tests are two-sided with the exception of the non-inferiority test, which is one-sided (with a pre-specified non-inferiority margin of 0.075 and alpha of 0.05). The margin was selected based on projected confidence intervals and power calculations using the tune dataset. No adjustment for multiple comparisons was implemented. For Ki-67 correlation analysis of mitotic count and Ki-67, permutation testing between DLS and Pathologists MC score was performed with 1,000 samples. C-indices were computed using the lifelines.utils.concordance_index function in the Python Lifelines package (v0.26.0) and additional analyses were performed using Python (v3.7.10), Numpy (v1.19.5), and scikit-learn (v0.24.1) libraries.

Data availability

TCGA data utilized in this study is publicly available via the Genomic Data Commons Data Portal (gdc.cancer.gov). The tertiary hospital data was used under a Defense Health Agency data sharing agreement. Requests regarding data can be directed to the Defense Health Agency Privacy Office at [email protected]. The medical laboratory dataset is not publicly available at this time due to data privacy considerations

Code availability

The deep learning framework (TensorFlow v1.14) used in this study is available at https://github.com/tensorflow/tensorflow/tree/r1.14. The convolutional neural network ResNet50x1 is available at https://github.com/google-research/big_transfer/blob/master/bit_tf2/models.py and instructions for downloading pretrained models can be found at https://github.com/google-research/big_transfer. The stage 2 models were implemented with scikit-learn (v0.24.0), available at https://github.com/scikit-learn/scikit-learn/tree/0.24.X. Details and code for stage 2 featurization and utlization is available at https://github.com/Google-Health/google-health/tree/master/breast_survival_prediction. The final, trained models have not yet undergone regulatory review and cannot be made available at this time. Interested researchers can contact P.-H.C.C. for questions on its status and access.

References

Sung, H. et al. Global Cancer Statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209–249 (2021).
Article PubMed Google Scholar
Rakha, E. A. et al. Prognostic significance of Nottingham histologic grade in invasive breast carcinoma. J. Clin. Oncol. 26, 3153–3158 (2008).
Article PubMed Google Scholar
Rakha, E. A. et al. Breast cancer prognostic classification in the molecular era: the role of histological grade. Breast Cancer Res. 12, 207–218 (2010).
Article PubMed PubMed Central Google Scholar
Elston, C. W. & Ellis, I. O. Pathological prognostic factors in breast cancer. I. The value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology 19, 403–410 (1991).
Article CAS PubMed Google Scholar
Amin, M. B. et al. The Eighth Edition AJCC Cancer Staging Manual: continuing to build a bridge from a population-based to a more ‘personalized’ approach to cancer staging. CA Cancer J. Clin. 67, 93–99 (2017).
Article PubMed Google Scholar
Wishart, G. C. et al. PREDICT: a new UK prognostic model that predicts survival following surgery for invasive breast cancer. Breast Cancer Res. 12, R1 (2010).
Article PubMed PubMed Central Google Scholar
Galea, M. H., Blamey, R. W., Elston, C. E. & Ellis, I. O. The Nottingham Prognostic Index in primary breast cancer. Breast Cancer Res. Treat. 22, 207–219 (1992).
Article CAS PubMed Google Scholar
Ginter, P. S. et al. Histologic grading of breast carcinoma: a multi-institution study of interobserver variation using virtual microscopy. Mod. Pathol. 34, 701–709 (2021).
Article PubMed Google Scholar
Mahmood, T., Arsalan, M., Owais, M., Lee, M. B. & Park, K. R. Artificial intelligence-based mitosis detection in breast cancer histopathology images using faster R-CNN and Deep CNNs. J. Clin. Med. Res. 9, 749–773 (2020).
Google Scholar
Balkenhol, M. C. A. et al. Deep learning assisted mitotic counting for breast cancer. Lab. Investig. 99, 1596–1606 (2019).
Article PubMed Google Scholar
Cai, D., Sun, X., Zhou, N., Han, X. & Yao, J. Efficient mitosis detection in breast cancer histology images by RCNN. In Proc. IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) 919–922 (2019).
Veta, M. et al. Predicting breast tumor proliferation from whole-slide images: the TUPAC16 challenge. Med. Image Anal. 54, 111–121 (2019).
Article PubMed Google Scholar
Mercan, C. et al. Automated scoring of nuclear pleomorphism spectrum with pathologist-level performance in breast cancer. arXiv [eess.IV] (2020).
Lu, C. et al. Nuclear shape and orientation features from H&E images predict survival in early-stage estrogen receptor-positive breast cancers. Lab. Investig. 98, 1438–1448 (2018).
Article CAS PubMed Google Scholar
Li, L. et al. Multi-task deep learning for fine-grained classification and grading in breast cancer histopathological images. Multimed. Tools Appl. 79, 14509–14528 (2020).
Article Google Scholar
Motlagh, M. H. et al. Breast cancer histopathological image classification: a deep learning approach. https://doi.org/10.1101/242818.
Bejnordi, B. E. et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318, 2199–2210 (2017).
Article Google Scholar
Liu, Y. et al. Artificial intelligence-based breast cancer nodal metastasis detection: insights into the black box for pathologists. Arch. Pathol. Lab. Med. 143, 859–868 (2019).
Article CAS PubMed Google Scholar
Couture, H. D. et al. Image analysis with deep learning to predict breast cancer grade, ER status, histologic subtype, and intrinsic subtype. NPJ Breast Cancer 4, 30 (2018).
Article PubMed PubMed Central Google Scholar
Gamble, P. et al. Determining breast cancer biomarker status and associated morphological features using deep learning. Commun. Med. 1, 1–12 (2021).
Article Google Scholar
Bychkov, D. et al. Deep learning identifies morphological features in breast cancer predictive of cancer ERBB2 status and trastuzumab treatment efficacy. Sci. Rep. 11, 1–10 (2021).
Article Google Scholar
Naik, N. et al. Deep learning-enabled breast cancer hormonal receptor status determination from base-level H&E stains. Nat. Commun. 11, 5727 (2020).
Article CAS PubMed PubMed Central Google Scholar
Ektefaie, Y. et al. Integrative multiomics-histopathology analysis for breast cancer classification. NPJ Breast Cancer 7, 147 (2021).
Article CAS PubMed PubMed Central Google Scholar
Elsharawy, K. A., Gerds, T. A., Rakha, E. A. & Dalton, L. W. Artificial intelligence grading of breast cancer: a promising method to refine prognostic classification for management precision. Histopathology 79, 187–199 (2021).
Article PubMed Google Scholar
Ibrahim, A. et al. Artificial intelligence in digital breast pathology: techniques and applications. Breast 49, 267–273 (2020).
Article PubMed Google Scholar
Wang, Y. et al. Improved breast cancer histological grading using deep learning. Ann. Oncol. 33, 89–98 (2022).
Article CAS PubMed Google Scholar
Chen, P.-H. C., Mermel, C. H. & Liu, Y. Evaluation of artificial intelligence on a reference standard based on subjective interpretation. Lancet Digital Health https://doi.org/10.1016/S2589-7500(21)00216-8 (2021).
Article PubMed Google Scholar
Liu, J. et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173, 400–416.e11 (2018).
Article CAS PubMed PubMed Central Google Scholar
Wulczyn, E. et al. Predicting prostate cancer specific-mortality with artificial intelligence-based Gleason grading. Commun. Med. 1, 1–8 (2021).
Article Google Scholar
de Azambuja, E. et al. Ki-67 as prognostic marker in early breast cancer: a meta-analysis of published studies involving 12,155 patients. Br. J. Cancer 96, 1504–1513 (2007).
Article PubMed PubMed Central Google Scholar
Yamamoto, S. et al. Clinical relevance of Ki67 gene expression analysis using formalin-fixed paraffin-embedded breast cancer specimens. Breast Cancer 20, 262–270 (2013).
Article PubMed Google Scholar
van Diest, P. J., van der Wall, E. & Baak, J. P. A. Prognostic value of proliferation in invasive breast cancer: a review. J. Clin. Pathol. 57, 675–681 (2004).
Article PubMed PubMed Central Google Scholar
Baak, J. P. A. et al. Prospective multicenter validation of the independent prognostic value of the mitotic activity index in lymph node-negative breast cancer patients younger than 55 years. J. Clin. Oncol. 23, 5993–6001 (2005).
Article PubMed Google Scholar
Inwald, E. C. et al. Ki-67 is a prognostic parameter in breast cancer patients: results of a large population-based cohort of a cancer registry. Breast Cancer Res. Treat. 139, 539–552 (2013).
Article CAS PubMed PubMed Central Google Scholar
Andre, F., Arnedos, M., Goubar, A., Ghouadni, A. & Delaloge, S. Ki67–no evidence for its use in node-positive breast cancer. Nat. Rev. Clin. Oncol. 12, 296–301 (2015).
Article PubMed Google Scholar
Tellez, D. et al. Whole-slide mitosis detection in H&E breast histology using PHH3 as a reference to train distilled stain-invariant convolutional networks. IEEE Trans. Med. Imaging https://doi.org/10.1109/TMI.2018.2820199 (2018).
Article PubMed Google Scholar
Green, A. R. et al. Nottingham prognostic index plus (NPI+) predicts risk of distant metastases in primary breast cancer. Breast Cancer Res. Treat. 157, 65–75 (2016).
Article CAS PubMed PubMed Central Google Scholar
Klein, M. E. et al. Prediction of the Oncotype DX recurrence score: use of pathology-generated equations derived by linear regression analysis. Mod. Pathol. 26, 658–664 (2013).
Article PubMed PubMed Central Google Scholar
Kolesnikov, A. et al. Big Transfer (BiT): general visual representation learning. in Computer Vision—ECCV 2020 491–507 (Springer International Publishing, 2020).
Golovin, D. et al. Google Vizier: a service for black-box optimization. in Proc. 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1487–1495 (Association for Computing Machinery, 2017).

Download references

Acknowledgements

This work was funded by Google LLC and Verily Life Sciences. The authors would like to acknowledge the Google Health pathology and software infrastructure labeling teams, and in particular Krishna Gadepalli, Liron Yatziv, Matthew Symonds, Aleksey Boyko, Harry Wang, Angela Lin, Melissa Moran, and Allen Chai for software infrastructure support and data collection. We also appreciate the input of Akinori Mitani and Dale Webster for their feedback on the manuscript. We also thank the pathologists who reviewed and annotated cases in this study. N.O. and C.R. are military service members or federal/contracted employees of the United States government. This work was prepared as part of their official duties. Title 17, United States Code, §105, provides that “copyright protection under this title is not available for any work of the US Government”. Title 17, United States Code, §101 defines a US Government work as a work prepared by a military service member or employee of the US Government as part of that person’s official duties. J.H.W. is a member of the Henry M. Jackson Foundation for the Advancement of Military Medicine, Inc., Bethesda, MD. The contents of this publication are the sole responsibility of the author(s) and do not necessarily reflect the views, opinions, or policies of the Department of Defense (DoD) or The Henry M. Jackson Foundation for the Advancement of Military Medicine, Inc. Mention of trade names, commercial products, or organizations does not imply endorsement by the U.S. Government.

Author information

These authors contributed equally: David F. Steiner, Po-Hsuan Cameron Chen.

Authors and Affiliations

Google Health, Palo Alto, CA, USA
Ronnachai Jaroensri, Ellery Wulczyn, Narayan Hegde, Fraser Tan, Yuannan Cai, Greg S. Corrado, Lily H. Peng, Craig H. Mermel, Yun Liu, David F. Steiner & Po-Hsuan Cameron Chen
Work done at Google Health via Vituity, Emeryville, CA, USA
Trissia Brown & Isabelle Flament-Auvigne
Work done at Google Health, current affiliation Tempus Labs Inc, Chicago, IL, USA
Kunal Nagpal
Department of Pathology, School of Medicine, University of Nottingham, Nottingham, UK
Emad A. Rakha
John A. Burns University of Hawaii Cancer Center, Honolulu, HI, USA
David J. Dabbs
Department of Pathology, Magee-Womens Hospital of UPMC, Pittsburgh, PA, USA
David J. Dabbs
Defense Innovation Unit, Mountain View, CA, USA
Niels Olson
Uniformed Services University, Bethesda, MD, USA
Niels Olson
Henry M. Jackson Foundation, Bethesda, MD, USA
James H. Wren, Elaine E. Thompson & Erik Seetao
Laboratory Department, Naval Medical Center San Diego, San Diego, CA, USA
Carrie Robinson
Verily Life Sciences, South San Francisco, CA, USA
Melissa Miao & Fabien Beckers

Authors

Ronnachai Jaroensri
View author publications
You can also search for this author in PubMed Google Scholar
Ellery Wulczyn
View author publications
You can also search for this author in PubMed Google Scholar
Narayan Hegde
View author publications
You can also search for this author in PubMed Google Scholar
Trissia Brown
View author publications
You can also search for this author in PubMed Google Scholar
Isabelle Flament-Auvigne
View author publications
You can also search for this author in PubMed Google Scholar
Fraser Tan
View author publications
You can also search for this author in PubMed Google Scholar
Yuannan Cai
View author publications
You can also search for this author in PubMed Google Scholar
Kunal Nagpal
View author publications
You can also search for this author in PubMed Google Scholar
Emad A. Rakha
View author publications
You can also search for this author in PubMed Google Scholar
David J. Dabbs
View author publications
You can also search for this author in PubMed Google Scholar
Niels Olson
View author publications
You can also search for this author in PubMed Google Scholar
James H. Wren
View author publications
You can also search for this author in PubMed Google Scholar
Elaine E. Thompson
View author publications
You can also search for this author in PubMed Google Scholar
Erik Seetao
View author publications
You can also search for this author in PubMed Google Scholar
Carrie Robinson
View author publications
You can also search for this author in PubMed Google Scholar
Melissa Miao
View author publications
You can also search for this author in PubMed Google Scholar
Fabien Beckers
View author publications
You can also search for this author in PubMed Google Scholar
Greg S. Corrado
View author publications
You can also search for this author in PubMed Google Scholar
Lily H. Peng
View author publications
You can also search for this author in PubMed Google Scholar
Craig H. Mermel
View author publications
You can also search for this author in PubMed Google Scholar
Yun Liu
View author publications
You can also search for this author in PubMed Google Scholar
David F. Steiner
View author publications
You can also search for this author in PubMed Google Scholar
Po-Hsuan Cameron Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.J. performed the majority of the machine learning development and validation, with additional machine learning contributions and analysis input from Y.L., E.W., N.H., Y.C., K.N., P.-H.C.C., and D.F.S. Y.C. implemented the software infrastructure to help support machine learning development. F.T., I.F.-A., T.B., N.O., J.W., E.E.T., and E.S. collected and helped oversee quality assurance for the data. I.F.-A., T.B., D.J.D., E.A.R., N.O., J.H.W., E.E.T., and C.R. provided pathology and research input and analysis review. G.S.C., L.H.P., F.B., M.M., C.H.M., D.F.S., and P.-H.C.C. obtained funding for data collection and analysis, supervised the study, and provided strategic guidance. R.J., Y.L., D.F.S., and P.-H.C.C. prepared the manuscript with input from all authors.

Corresponding authors

Correspondence to David F. Steiner or Po-Hsuan Cameron Chen.

Ethics declarations

Competing interests

This study was funded by Google LLC and Verily Life Sciences. R.J., E.W., N.H., F.T., Y.C., K.N., G.S.C., L.H.P., Y.L., C.H.M., D.F.S., and P.-H.C.C. declare no competing financial interests but the following competing non-financial interests: employees of Google LLC and own Alphabet stock. I.F.-A. and T.B. declare no competing financial interests but the following competing non-financial interests: consultants of Google LLC. F.B. and M.M. declare no competing financial interests but the following competing non-financial interests: employees of Verily. D.J.D., E.A.R., N.O., J.H.W., E.E.T., E.S., and C.R. declare no competing financial or non-financial Interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jaroensri, R., Wulczyn, E., Hegde, N. et al. Deep learning models for histologic grading of breast cancer and association with disease prognosis. npj Breast Cancer 8, 113 (2022). https://doi.org/10.1038/s41523-022-00478-y

Download citation

Received: 07 February 2022
Accepted: 01 September 2022
Published: 04 October 2022
DOI: https://doi.org/10.1038/s41523-022-00478-y

This article is cited by

Exploration of the relationship between tumor-infiltrating lymphocyte score and histological grade in breast cancer
- Deyong Kang
- Chuan Wang
- Jianxin Chen
BMC Cancer (2024)
Development and prognostic validation of a three-level NHG-like deep learning-based model for histological grading of breast cancer
- Abhinav Sharma
- Philippe Weitz
- Mattias Rantalainen
Breast Cancer Research (2024)
Current status and prospects of artificial intelligence in breast cancer pathology: convolutional neural networks to prospective Vision Transformers
- Ayaka Katayama
- Yuki Aoki
- Tetsunari Oyama
International Journal of Clinical Oncology (2024)
Computational pathology improves risk stratification of a multi-gene assay for early stage ER+ breast cancer
- Yuli Chen
- Haojia Li
- Anant Madabhushi
npj Breast Cancer (2023)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Cohort characteristics

Performance of deep learning systems for component features

Prognostic value of AI-NGS

Mitotic count and Ki-67 expression

Discussion

Methods

Data

Whole slide image inclusion criteria

Annotations

Invasive carcinoma segmentation

Mitotic figures

Nuclear pleomorphism and tubule formation

Slide-level grading

Deep learning system development

Mitotic figure detection and counting from patch-level stage-1 output

Deep learning system evaluation

Progression-free interval analysis

Statistical analysis

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links