Deep learning-based predictive biomarker of pathological complete response to neoadjuvant chemotherapy from histological images in breast cancer

Background Pathological complete response (pCR) is considered a surrogate endpoint for favorable survival in breast cancer patients treated with neoadjuvant chemotherapy (NAC). Predictive biomarkers of treatment response are crucial for guiding treatment decisions. With the hypothesis that histological information on tumor biopsy images could predict NAC response in breast cancer, we proposed a novel deep learning (DL)-based biomarker that predicts pCR from images of hematoxylin and eosin (H&E)-stained tissue and evaluated its predictive performance. Methods In total, 540 breast cancer patients receiving standard NAC were enrolled. Based on H&E-stained images, DL methods were employed to automatically identify tumor epithelium and predict pCR by scoring the identified tumor epithelium to produce a histopathological biomarker, the pCR-score. The predictive performance of the pCR-score was assessed and compared with that of conventional biomarkers including stromal tumor-infiltrating lymphocytes (sTILs) and subtype. Results The pCR-score derived from H&E staining achieved an area under the curve (AUC) of 0.847 in predicting pCR directly, and achieved accuracy, F1 score, and AUC of 0.853, 0.503, and 0.822 processed by the logistic regression method, respectively, higher than either sTILs or subtype; a prediction model of pCR constructed by integrating sTILs, subtype and pCR-score yielded a mean AUC of 0.890, outperforming the baseline sTIL-subtype model by 0.051 (0.839, P  =  0.001). Conclusion The DL-based pCR-score from histological images is predictive of pCR better than sTILs and subtype, and holds the great potentials for a more accurate stratification of patients for NAC. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-021-03020-z.

response to NAC requires a pathological examination. Patients who have the pathological complete response (pCR) are expected to have a better outcome than those with the pathological noncomplete response (non-pCR) [4]. Therefore, pCR after NAC has been regarded as a surrogate endpoint of favorable survival for NAC [5]. However, NAC is not effective for all patients, and only a subset of them can achieve pCR [6]. Patients who do not achieve pCR may suffer from toxic effects during NAC, which is likely to worsen their prognosis while accruing high treatment costs. Therefore, predicting pCR before NAC of breast cancers has significant value in sparing the patients from possibly ineffective treatment.
At present, the biomarkers of tumor size [7], histological grade [8], Ki67 [9], immunochemistry (IHC)-based subtype [8,10,11], and stromal tumor-infiltrating lymphocytes (sTILs) [12][13][14] are the clinicopathological (CP) factors used in predicting pCR partly due to their wide availability in routine clinical practice. Briefly, small tumor size [7], high grade [8], high Ki67 expression [9]], and high sTILs [12][13][14] are positively related to pCR in NAC settings, while hormone receptor-positive (HR +) or human epidermal growth factor receptor 2-negative (HER2 −) subtypes have lower pCR rates than HR-or HER2 + subtypes [10,11]. However, these easy-to-get factors from CP data are not robust enough for pCR prediction in all breast cancer patients. In the meantime, although recent studies have proposed some molecular signatures to predict pCR to NAC in breast cancer [15][16][17][18], those have been validated in only some of the conducted trials, and they also have the substantial drawbacks of high cost and a considerable time investment. Therefore, there is still an urgent need to develop robust and inexpensive biomarkers for the prediction of pCR to NAC in breast cancer.
Apart from the CP and molecular biomarkers for pCR prediction, artificial intelligence technologies applied in image data can develop predictive signatures by extracting hidden information directly from medical images, such as applications in radiological images involving diffuse optical spectroscopy [19], MRI [20], and PET/CT [21], to predict the treatment response to NAC in breast cancer. Compared with radiological images, histological images, which have been regarded as the gold standard for disease diagnosis, can provide more abundant information on tumor characteristics reflecting underlying molecular processes and disease progression. However, the complex and abundant information from histological images is difficult to adequately use because human assessment mainly relies on visually visible features, while deep learning (DL) technology can address the problem by integrating the visible and subvisible information of recurring patterns from complex images [22].
For instance, the cooperation of DL technology and histological images has shown positive results in processing clinical matters, such as tumor detection [23] and prognosis prediction [24,25], even presenting satisfying performance in complex matters such as the prediction of gene mutations [26], classification of multimolecular profiles [27], and microsatellite instability [28] among different cancer types. Additionally, a recent investigation in rectal cancer reported that quantitative features extracted by machine learning from histological images can be predictive for treatment response to neoadjuvant chemoradiotherapy (NCRT) [29]. According to these studies, we argue that analyzing histological images through DL technology could contribute to developing predictive biomarkers of treatment response to NAC in breast cancer.
In this study, we aimed to develop a DL-based biomarker using H&E-stained images, pCR-score, to predict pCR of breast cancer patients receiving NAC, which presents a stronger prediction ability than the conventional pathological factors of subtype and sTILs.

Study population and slides
A total of 540 patients who received NAC in January 2008 and June 2020 at West China Hospital were retrospectively enrolled. The inclusion criteria were as follows: (1) patients diagnosed with primary breast invasive ductal cancer (IDC) without metastasis via a needle biopsy before NAC; (2) patients receiving NAC regimens based on anthracycline, taxane, or anthracycline combined with taxane (≥ 4 cycles) and not undergoing prior therapy (detailed NAC regimens in Additional file 1: Table S1); and (3) patients underwent surgery after NAC and was confirmed by pathologic examination whether pCR. The exclusion criteria were as follows: (1) patients received a nonstandard treatment regimen, mainly referring to the treatment of HER2 + breast cancers without trastuzumab; (2) patients lacking complete CP data; (3) patients diagnosed with bilateral, multifocal, or special invasive breast cancer; and (4) the pathological slides from the patient's biopsy were lost, or the H&E-stained slides were of insufficient quality. The process of patient inclusion is summarized in Additional file 1: Figure S1.
Apart from the H&E-stained-slides corresponding to patients enrolled, an additional dataset of 25 H&E-stained IDC slides were designated for developing the automated workflows for tumor epithelium identification.

Pathological evaluation and data collection
Pretreatment breast biopsies were performed via ultrasound-guided core needle, routinely fixed in 10% neutral buffered formalin, and stained as H&E slides for diagnosis after paraffin embedding. The surgical specimens after NAC were sampled adequately in the form of tissue slides and examined microscopically by experienced pathologists. pCR was defined as ypT0/isN0 (no residual invasive disease in breast and node) (4). The estrogen receptor (ER), progesterone receptor (PR), HER2 status, and the Ki67 index were assessed through IHC. ER/PR positivity was defined as positive nuclei staining no less than 1% of tumor cells [30]. Regarding Ki67 index, samples were divided into a low-expression set (≤ 20%) and a highexpression set (> 20%) [31]. HER2 status was defined as positive only when IHC (3 +) and (or) amplified by fluorescence in situ hybridization (FISH), while breast cancer with IHC (0/1 +) and (or) unamplified by FISH was considered as HER2-negative disease [32]. sTILs were evaluated on H&E-stained slides according to the international recommended guidelines [33], with intervals of 10% from 1 to 90%, separated into low sTILs (< 10%), moderate sTILs (10% ≤ and < 40%), and high sTILs(≤ 40%). The nuclear grade was assessed based on the Nottingham grading system, and the presence or absence of necrosis was assessed on diagnostic H&E-stained slides. To reduce the subjectivity of pathological evaluation, examinations of sTILs, nuclear grade, and treatment response were performed by two observers separately, and samples that were scored inconsistently by the two observers were assessed repeatedly until a consensus was reached.
Apart from the factors above, other clinical data including the age of the patient at diagnosis, tumor/node (T/N) stages, and menstrual status were collected at the same time.

Image processing and model construction
The H&E-stained slides of pretreatment biopsies were scanned at 40 × magnification via a Hamamatsu scanner to prepare whole slide images (WSIs) for experiments. Tumor epithelium (TE) regions of WSIs were identified using a DL-based classification approach (details in Additional file 1: Figure S2). Based on a convolutional neural network I (CNN I), a training dataset of 20 WSIs from an additional dataset was used to develop an automated TE identifying model, and the remaining 5 WSIs were used as the test set. Under manual review, TE was annotated inside two representative tumor regions in the training dataset and global image annotations were conducted in the testing dataset as the gold standard to test the performance of CNN I. Besides, the NDP Viewer 2 was applied in the annotation. Tiles identified as tumor epithelium by CNN I were delivered to a convolutional neural network II (CNN II), scoring the probability of the pCR for each tile.
In the pre-processing step, the developed tissue recognition tool in our previous study [34] was employed to segment the valid tissue areas from the input WSIs, which are cropped into tiles at a scale of 128 × 128 pixels. The deep learning method was employed to automatically identify tumor epithelium and make predictions of pCR based on H&E stained images. CNN I and CNN II were developed based on deep learning methods; Inception V3 was selected as the base deep learning architecture for the presented biomarker generating pipeline, because of its trade off between inferencing speed and classification accuracy [35]. Cross-entropy loss [36] and stochastic gradient descent (SGD) [37] were used in optimization. However, due to that TE tiles were more homogeneous to some extent than original mixed tiles after identified by CNN I, scoring the pCR of these selected TE tiles is more difficult than identifying TE regions in the segmented valid tissue areas. Hence, we first leveraged the recently proposed supervised contrastive learning [38] to optimize the feature extraction part of CNN II to produce features that can distinguish between pCR and non-pCR in the selected tiles. Then, based on the learned discriminative features, we optimized the prediction (classification) part of CNN II by using cross-entropy loss [36] and SGD [37]. Additionally, we leveraged a recently proposed fast ensemble deep learning strategy [39][40][41] to further boost the optimized CNN II. In the post-processing step, a pCR-score was calculated by averaging the pCR probabilities of TE tiles for each WSI, which was regarded as a novel biomarker for pCR prediction from histology ( Fig. 1). More details about the training and inference procedures of CNN I and CNN II are provided in Additional file 1: Figures S2, S4. The whole pipeline of pCR-score computing was implemented using Python based on TensorFlow/Keras.

Statistic analysis
The distribution of clinical characteristics between cohorts was compared using the χ 2 test or Fisher's exact test. The performance of the CNNs for identifying TE and predicting pCR was assessed by the area under the curve (AUC) for the receiver operating characteristic (ROC) curve. Univariate logistic regression analysis was used to evaluate the odds ratios and probabilities of both conventional predictors and pCR-score in correlation with pCR, after which multivariate logistic regression analysis was performed. The Mann-Whitney U test was used to compare the distribution of pCR-scores across patients with different sTILs densities and subtypes. Based on the logistic regression method, the prediction performance of pCR across biomarker-based models was assessed using the F1 score, accuracy, and AUC, along with sensitivity (equal to recall score)/specificity and positive predictive value (PPV, equal to precision)/negative predictive value (NPV), and comparisons of performance metrics among models were performed with the Wilcoxon signed-rank test or a paired t-test as appropriate. Additionally, the pCR-scores were normalized by z score. All statistical analysis was two-sided and P-value is less than 0.05 indicating statistical significance. The statistical analyses were performed using SPSS software, version 20.

Study population characteristics
According to the inclusion and exclusion criteria, a total of 540 eligible patients were enrolled in this study. Patients were randomly divided into the primary and validation datasets at a ratio of 8:2. The pCR rates were 18.7% in the primary dataset and 19.6% in the validation dataset, and no significant difference was detected in CP factors between the two datasets (Additional file 1: Table S2). The characteristics of the pCR cohort and non-pCR cohort in the two datasets are summarized in Table 1. It was observed that the statuses of ER (P < 0.001, P < 0.001), PR (P < 0.001, P = 0.005), and HER2 (P < 0.001, P < 0.001) were significantly associated with pCR in the primary and validation datasets. Similarly, sTILs at different levels (low, moderate, and high) showed a different distribution between pCR and non-pCR patients in both the primary and validation datasets (P < 0.001, P < 0.001). However, no significant difference was detected in terms of age, menopausal status, T stage, N stage, or necrosis in pCR and non-pCR cohorts, while the Ki67 index and nuclear grade were significantly correlated with pCR only in the primary dataset but not in the validation dataset.

The pCR-score derived from H&E-stained images is predictive of pCR
In the pre-processing step, an intelligent tool [34] developed previously was employed to segment the valid tissue for the WSI, with tiles of 128 × 128 pixels were generated. Then 44,348 tiles were generated from the training set (20 WSIs), and 20,313 tiles were generated from the test set (5 WSIs), which were used for developing the automated workflows for tumor epithelium identification. CNN I, which subdivided the tiles into TE, and non-TE, achieved an AUC of 0.851 for identifying TE tiles compared with the reference standard (Additional file 1: Figure S3), and other relevant performance metrics were shown in Additional file 1: Table S3.
From CNN I, ≤ 1000 TE tiles with identified high probabilities (> 0.9999) per WSI were selected for pCR scoring, resulting in a total of 292,025 tiles for the whole cohort. TE tiles with definite labels of pCR or non-pCR in the primary dataset were used to train CNN II to calculate the probability of pCR for each tile, and the mean risk of all selected tiles was computed as a pCR-score for one WSI (Fig. 1). Examples of DL-based pCR-score generation for patients with pCR and non-pCR are shown in Fig. 2. The five-fold cross-validation on the primary dataset and one test on the validation dataset were conducted to assess performance in predicting pCR to NAC directly using the raw pCR-score data. As shown in Fig. 3, the mean AUC in the primary dataset was 0.712 on the WSI level, while it was 0.847 in the validation dataset. Besides, AUCs at the tile level are provided in Additional file 1: Figure S5. Moreover, the predictive performance of the pCR-score for different subtypes (HR + /HER2 −, HR + / HER2 + , HR −/HER2 − and HR −/HER2 +) on the tilelevel and WSI-level are shown in Additional file 1: Figure  S6. Distributions of the generated pCR-score in the pCR group and the non-pCR group of the validation dataset are provided in Additional file 1: Figure S7.

The pCR-score is an independent biomarker correlated with pCR
To evaluate the clinical significance of the pCR-score, univariate and multivariate logistic regression analyses were performed in the validation dataset, including the biomarkers of routine clinical use as well (Table 2). Notably, as CNN II for pCR scoring was built on the primary dataset, the following experiments were conducted only in the validation dataset to avoid overfitting of the pCRscore. In univariate analysis, the pCR-score was a significant biomarker related to pCR with an odds ratio of 3.516 (95% CI 2.003-6.173, P < 0.001). Apart from pCR-score, subtype and sTILs were significantly correlated with pCR in the validation dataset, but T stage, Ki67, and nuclear grade were not. Besides, subtype and sTILs were independent markers correlated with pCR in the multivariate analysis without pCR-score; while adding it, pCR-score was the only significant predictor with an odds ratio of 4.045 (95% CI 1.822-8.980, P = 0.001). These results showed that the pCR-score was an independent biomarker correlated with pCR.

The pCR-score outperforms biomarkers of sTILs and subtype in predicting pCR
For comparisons with pCR-score, we also assessed the predictive ability of the baseline biomarkers using logistic regression models. 60% WSIs of the validation dataset were randomly selected as the training set to build prediction models with biomarkers and the remaining 40% were taken as the test set to compare the performance. This procedure was repeated 16 times to avoid unfair assessments due to data bias ( Fig. 4; Table 3; and Additional file 1: Table S4). The prediction model based on the pCR-score presented better performance in predicting pCR than sTILs/subtype-based model, especially showed significantly higher accuracy and PPV/precision (Table 3: (Table 3). Other detailed performance metrics are available in Fig. 4; Table 3. Moreover, the T stage/Ki67/nuclear grade-based models showed poor predictive ability, and the addition of them did not improve the baseline model (Table 3; Additional file 1: Table S4). Moreover, an integrated logistic regression model was constructed based on the biomarkers of sTILs, subtype, and pCR-score to assess the relative ability of the pCRscore to predict pCR in comparison with the baseline model. With the addition of the pCR-score in the integrated model, we found that the mean performance metrics were significantly improved from 0.840 to 0.884 (P < 0.001), 0.839 to 0.890 (P = 0.001), and 0.565 to 0.682 (P = 0.002) in the accuracy, AUC, and F1 score respectively, and other metrics including sensitivity/recall, PPV/ precision were significantly improved while specificity and NPV were increased without significance (detailed metrics in Fig. 4; Table 3).

Distributions of the pCR-score varied among subtypes
To further investigate the relationship of the pCR-score with sTILs and subtype, we visualized how the pCR-score was distributed across patients of different subtypes and sTILs densities (Fig. 5). Patients with the HR −/ HER2 + or HR + /HER2 + subtypes appeared to have higher pCR-scores than those with the HR + /HER2 − subtype (P = 0.003, P = 0.019). However, HR −/HER2 − had an intermediate pCR-score among the four subtypes, showing a slightly higher trend of pCR-score than that of the HR + /HER2 − subtype without significance (P = 0.79). Additionally, we visualized the distribution of the pCR-score in different sTILs densities. A trend toward elevated pCR-scores was observed in patients with higher sTIL density, but the difference between the distributions was not significant (Fig. 5).

Discussion
In this study, we proposed the DL-based pCR-score, probably the first biomarker without predefined features from H&E-stained slides, which indicated the predictive potentials of the histological images for treatment response. The pCR-score presented herein is an independent predictor correlating with pCR in multivariate analysis, and it outperforms the conventional biomarkers in predicting pCR. Moreover, further experiments show that the pCR-score reflects additional predictive information solely from H&E-stained slides and is complementary to the existing biomarkers, providing a more robust prediction of pCR to identify the patients who are most likely to benefit from NAC.
Breast cancer is characterized by high heterogeneity of morphology reflecting the underlying molecular process, which can provide indicative information for clinical decision-making. For instance, nuclear pleomorphism is an essential constituent of the breast histological Fig. 1 The pipeline of the pCR-score computing consists of five sub-steps: a pre-processing; b CNN I; c middle-processing; d CNN II; and e post-processing. First, the pre-processing step segments valid tissue areas from the input WSI and crops the segmented valid tissue areas into small tiles. Second, the CNN I takes the cropped tiles as inputs and identifies TE regions by mapping the input tiles into probabilities corresponding to TE. Third, the middle-processing step selects TE tiles with identified high probabilities from the outputs of CNN I. Fourth, the CNN II takes the selected TE tiles as inputs and score the pCR of the input tiles by mapping them into probabilities corresponding to pCR. Finally, the post-processing step fuses the pCR probabilities of TE tiles scored via CNN II to produce the final predicted pCR-score of the input WSI grading system, which implies the aggressiveness of the disease and is related to prognosis. Apart from the manual assessment of the morphological features, computerextracted features of morphology like the nuclear shape/ texture were capable of predicting the patient survival [42]. These facts proved that the morphological characteristics of breast cancer could provide essential information on the disease. In this study, our results showed that the DL-based raw pCR-scores derived solely from H&E-stained slides achieved an AUC of 0.847 in predicting pCR; this simple predictor is not based on prior knowledge of breast biology or pathology, which implies that histological images contain potential information predicting the treatment response of breast cancer. In one study of 58 breast cancer patients, Dodington et al. [43] focused on the nuclear level after segmentation was used to extract a limited set of nuclear features for analyses; the nuclear intensity and gray-level co-occurrence matrix (GLCM-COR) of tumor nuclear features were found to be related to pCR in univariate analysis (P = 0.035, P = 0.039). Differently, our learning process with CNN II was guided simply by the assessment results of the treatment response instead of focusing on specific features of tumor nuclear morphology, which allowed us to explore a wider range of image information values. During the image-processing step, we set up automated workflows to identify the regions of interest (ROIs), including valid tissue detection [34] as the first step followed by TE identification based on CNN I, which addressed the problem of manual annotations for large image datasets by automating the annotation process. Notably, we found that there was a smaller mean number of TE tiles in patients with pCR who were not correctly predicted than in patients with pCR who were correctly predicted (603 vs 858), and the prediction accuracy of pCR for patients with ≥ 500 tiles was higher than patients with < 500 tiles (50% vs 10%), which implied that the small number of tumor tissue could not reflect sufficient morphological information, resulting in an unfair assessment of pCR-score. A possibility is the high heterogeneity of breast cancers, especially in a small quantity of sampled tissue of tumor biopsies, which also can affect the manual evaluation of histopathological features.
At present, information derived from medical images has been accepted as a novel prognostic and predictive biomarker in oncology [22,24,44,45]. For example, Skrede et al. [24] developed a useful DL-based prognostic biomarker from histological images and proved it outperformed established molecular and morphological prognostic markers; Kather et al. [44] proposed the deep stroma score, a DL-based biomarker solely from H&E staining images, which was demonstrated to be an independent prognostic factor in colorectal cancer. In the present study, we proposed the pCR-score, independently correlated with pCR, is a promising biomarker that can classify breast cancer patients as "potential responders" or "potential non-responders" solely based on H&E-stained images. Unlike conventional histopathological biomarkers which require manual assessments, the pCR-score from scanned images is generated by the DL systems without the extra effort of manual evaluation, which prevents high intra-observer subjectivity and inter-observer variation. The DL-based pCR-score can provide complementary information that is not being  extracted from routine material in current clinical workflows, whose combination with conventional biomarkers has the potential of better stratifying patients to decide which individuals are more likely to benefit from NAC. We also found that the distributions of pCR-scores across different subtypes of breast cancers were varied, which were correlated with the response rates of subtypes to NAC [10,11]. For example, higher pCR-scores are more likely to appear in patients with HER2 + subtypes or HR − subtypes (the lack of a significant difference between the HR −/HER2 − and HR +/HER2 − subtypes might be due to the small number of HR −/HER2 − samples), which suggests that the pCR-score derived from H&E staining images might reflect the histological difference associated with subtypes to predict treatment response to NAC. Indeed, published data support the idea that morphological features of breast cancer can provide the subtype information [46]. However, the differences in the distribution of pCR-scores among sTILs densities were not significant, since the pCR-score is derived from the TE while the assessment of sTILs is focused on stromal regions.
Although our study demonstrated the potential of tumor histology to predict pCR via DL approaches and proposed a novel biomarker that is a more effective predictor than sTILs or subtype, it still has some limitations. In this study, only 540 patients were used retrospectively for training and validation; hence, future studies should pursue prospective multicenter investigations. Second, only some of the tiles from each WSI were used for training and prediction; subsequent research should aim to develop more advanced methods to incorporate more tiles to better account for the heterogeneity of breast cancer. Third, although the pCR-score does not rely on the manual assessment as sTILs and subtype, it needs an expert of data analyst to generate. Additionally, although the integrated model was objectively superior to others, which has already supported our conclusion, its F1 score and sensitivity were not subjectively high; future improvement of it in a large cohort is essentially required for us. Alternatively, considering that this study demonstrated the value of the TE in predicting the treatment response to NAC, we will continue to explore the predictive potential of the nontumor compartment of breast cancer via the DL approaches.

Conclusion
Conclusively, we proposed the pCR-score, a promising DL-based histological biomarker, and demonstrated its excellent performance in predicting pCR to NAC exceeding the basic biomarkers of sTILs and subtype. The pCRscore facilitates the better stratification of breast cancer patients for NAC; with more DL-based biomarkers developed, a more robust predictive model may be created to assist clinical treatment planning.