Real-time detection of laryngopharyngeal cancer using an artificial intelligence-assisted system with multimodal data

Background Laryngopharyngeal cancer (LPC) includes laryngeal and hypopharyngeal cancer, whose early diagnosis can significantly improve the prognosis and quality of life of patients. Pathological biopsy of suspicious cancerous tissue under the guidance of laryngoscopy is the gold standard for diagnosing LPC. However, this subjective examination largely depends on the skills and experience of laryngologists, which increases the possibility of missed diagnoses and repeated unnecessary biopsies. We aimed to develop and validate a deep convolutional neural network-based Laryngopharyngeal Artificial Intelligence Diagnostic System (LPAIDS) for real-time automatically identifying LPC in both laryngoscopy white-light imaging (WLI) and narrow-band imaging (NBI) images to improve the diagnostic accuracy of LPC by reducing diagnostic variation among on-expert laryngologists. Methods All 31,543 laryngoscopic images from 2382 patients were categorised into training, verification, and test sets to develop, validate, and internal test LPAIDS. Another 25,063 images from five other hospitals were used as external tests. Overall, 551 videos were used to evaluate the real-time performance of the system, and 200 randomly selected videos were used to compare the diagnostic performance of the LPAIDS with that of laryngologists. Two deep-learning models using either WLI (model W) or NBI (model N) images were constructed to compare with LPAIDS. Results LPAIDS had a higher diagnostic performance than models W and N, with accuracies of 0·956 and 0·949 in the internal image and video tests, respectively. The robustness and stability of LPAIDS were validated in external sets with the area under the receiver operating characteristic curve values of 0·965–0·987. In the laryngologist-machine competition, LPAIDS achieved an accuracy of 0·940, which was comparable to expert laryngologists and outperformed other laryngologists with varying qualifications. Conclusions LPAIDS provided high accuracy and stability in detecting LPC in real-time, which showed great potential for using LPAIDS to improve the diagnostic accuracy of LPC by reducing diagnostic variation among on-expert laryngologists. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-023-04572-y.


Background
Laryngopharyngeal cancer (LPC), including laryngeal cancer (LCA) and hypopharyngeal cancer, is the second most common malignancy among head and neck tumours, with more than 130,000 deaths reported in 2020 [1].Laryngoscopy biopsy is the gold standard for diagnosing LPC [2,3].In-office transnasal flexible electronic endoscopy can intuitively examine the laryngopharynx, making it the most effective device for detecting LPC [4,5].The limited resolution and contrast of white light can lead to the neglect or missed diagnosis of superficial mucosal cancers, even by experienced endoscopists [6,7].This can lead to patients being diagnosed at a later stage and thus having to undergo a multimodal treatment approach, resulting in poor prognosis and reduced quality of life [8][9][10].Furthermore, a precautionary biopsy is usually prescribed to avoid the missed diagnosis of earlystage cancer, resulting in overtreatment and emotional stress to patients [11].Recently, endoscopic systems with narrow-band imaging (NBI), which can improve the clarity and identification of epithelial and subepithelial microvessels, have played a critical role in the early diagnosis of LPC with high specificity and sensitivity [12][13][14].However, owing to the relatively long professional training and accumulation of clinical experience, this technology is at high risk of missing suspicious LPCs in endoscopy examinations in hospitals with inexperienced laryngologists, underdeveloped regions, and countries with large numbers of patients [15,16].
Recently, artificial intelligence (AI) has shown great potential in assisting doctors in various medical fields with their diagnoses [17][18][19].Particularly, deep learning techniques based on deep convolutional neural networks (DCNN) have demonstrated extraordinary capabilities for medical image classification, detection, and segmentation [20,21].Benefiting from its super-resolution performance on microscopic images, AI can automatically infer complex microscopic imaging structures (i.e., abnormalities in the extent and colour intensity of mucosal tubular branches) and identify quantitative pixel-level features [22], which are usually indistinguishable from the human eye.Several studies have demonstrated the feasibility and effectiveness of deep learning for lesion detection and the pathological classification of endoscopic images.Unfortunately, there are still several limitations to the existing research, particularly concerning laryngoscopy.Despite the real-time nature of endoscopy, current research is limited to detecting a single image [23,24], and there is a lack of studies integrating AI into dynamic videos.
Additionally, most existing studies focus on a single light source, including the application of white-light imaging (WLI) and NBI images [25][26][27], without considering the fusion of their multimodal features, which may increase the possibility of missed diagnosis and misdiagnosis.
We developed a DCNN-based Laryngopharyngeal Artificial Intelligence Diagnostic System (LPAIDS) that incorporates NBI and WLI multimodal features for endoscopic diagnosis of laryngopharyngeal carcinoma.We aimed to investigate whether the model can achieve expert-comparable performance and be applied in realworld laryngoscopy scenarios.Therefore, to fully simulate the clinical scene of endoscopy in the real world, we extracted the video frames of laryngoscopy videos during real-world endoscopy for model training.The diagnostic performance was validated using a time-series test set and external test sets from five other hospitals, and its real-time detection performance was verified using video.Additionally, we compared the implementation of this LPAIDS with that of laryngologists of different qualifications using endoscopist-machine competition.

Study design and participants
This retrospective, multicentre diagnostic study was conducted in six tertiary hospitals in China.We retrospectively obtained a video of the electronic laryngoscope at the First Affiliated Hospital of Sun Yat-sen University (FAHSYSU).We extracted the required video frames, including NBI and WLI images, for the development, validation, and internal testing of the LPAIDS.Time-series sets were used to train, validate, and test the model to better evaluate the practicability in clinical practice.
To generalise the applicability of the LPAIDS, laryngoscopic images of patients were collected from the following five hospitals in China for an external test: Sun Yat-sen Memorial Hospital of Sun Yat-sen University (SYMSYSU), Nanfang Hospital of Southern Medical University (NHSMU), First Affiliated Hospital of Shenzhen University (FAHSU), Third Affiliated Hospital of Sun Yat-sen University (TAHSYSU), and Sixth Affiliated Hospital of Sun Yat-sen University (SAHSYSU).To evaluate the efficacy of LPAIDS in real time, videos stored in FAHSYSU from 1 December 2021 to 31 March 2022 were collected for performance testing, and 200 videos were randomly selected for performance comparison with different levels of endoscopists.
Enrolled laryngoscopic images or videos were obtained from consecutive patients aged ≥ 18 years who underwent laryngoscopy.According to the World Health Organization classification of tumours, the pathological diagnosis was confirmed by two board-certified pathologists using haematoxylin-eosin-stained tissue slides, which served as the gold standard for judgment.The exclusion criteria were patients who had previously undergone laryngeal surgery or chemotherapy and radiotherapy for LPC and those without a histologically confirmed pathological diagnosis.Patients with laryngopharyngeal lesions (including carcinomas of the larynx and hypopharynx) with histologically proven malignancies were eligible for this study.For normal controls or participants with histologically confirmed benign neoplasms (such as vocal cord polyps, vocal nodules, and vocal cord leucoplakia), no specific exclusion criteria were available regarding clinical characteristics or demographics.

Laryngoscopy and image quality control
All laryngoscopies in this study were performed in daily clinical practice as screening or pretreatment examinations.The equipment used in this study included different models of standard laryngoscopy (ENF-VT2, ENF-VH, ENF-VT3, ENF-V2, or ENF-V3; Olympus Medical Systems, Tokyo, Japan; EV-N, EV-NC20, or EV-NE; Xion, Berlin, Germany) and video systems (VIS-ERA ELITE OTV-S190, EVIS EXERA III CV-190, EVIS LUCERA CV-260SL, and VISERA Pro OTV-S7Pro; Olympus Medical Systems, Tokyo, Japan; XN HD3, Xion, Berlin, Germany).All laryngoscopy videos were stored in AVI or MP4 format, and images were stored in JPG format at the six hospitals.
Laryngoscopy video frames were extracted from the three doctoral students.The extracted video frames contained different representative positions and angles of the laryngopharynx and covered various activities of the laryngopharynx.Each patient captured no more than 10 video frames and avoided repeated sampling at the same location.Nasopharyngeal, oropharyngeal, and images of lesions that were difficult to assess because of poor visual field quality due to active bleeding, thick buffy coat, mucus, halos, defocus, blurring, and reflections were removed.Three highly experienced laryngoscopists at FAHSYSU, each with at least 5 years of experience in laryngoscopy and conducting more than 3000 laryngoscopy examinations, carefully reviewed all images and selected representative LPC and non-cancer images according to the pathologic reports.Three laryngoscopists independently delineated all cancer lesions to outline the boundaries of the actual lesion area within the images.Image annotation used the tool labelme (https:// github.com/ wkent aro/ label me).Annotated images were used as mask layers for model training.All images were reviewed using crosschecking and expert reviews for quality control to avoid individual bias.Annotations and delineations in the images were only finalised when a consensus was reached between at least two endoscopists.When two endoscopists could not agree, a senior laryngeal specialist with at least 20 years of experience in laryngopharyngeal tumours made the final decision.

Dataset distribution
The dataset distribution of this study is shown in Fig. 1.Laryngoscopy videos of 2775 patients were retrospectively obtained from the database of the Laryngoscopy Center of FAHSYSU, and 393 patients were excluded based on the exclusion criteria.Overall, 49,176 laryngoscopy video frames were extracted from the remaining 2382 patients.After quality assessment, 17,633 frames were discarded because of poor quality or unavailable pathology reports.For patients with cancer, only images of cancerous lesions were included.Images of normal controls and benign lesions were included for patients without cancer.The remaining 31,543 images were used for model training, temporal verification, and temporal testing, and 1005 videos were used for model temporal verification and temporal testing.A dataset of 25,293 images of 6806 patients from five other centres was considered as an external test set.The patients were independent in the different datasets.Additionally, a human-machine competition set of 200 videos randomly selected from the temporal internal video test sets was used to compare the performance of LPAIDS and laryngologists with different qualifications.All videos or images were anonymised before recording to protect the patients' privacy.

Development of models
Since the diagnosis was a classification task, we conducted a diagnosis based on the output of semantic segmentation models.As shown in Fig. 2, first, the semantic segmentation models were used to predict tumour regions on each video frame.Second, we decided whether the video frames were classified as cancer according to the size and shape of the regions.Finally, we conducted a diagnosis based on the continuous LPC regions in the video frame sequence.The model's algorithm was based on the concept of U-Net [28], which consists of an encoder and decoder to extract and combine different levels of features.The encoder included four convolutional blocks with two 3 × 3 layers, each followed by a rectified linear unit (ReLU) and a 2 × 2 max pooling operation with a stride of 2 for downsampling.The decoder comprised four upsampling blocks with a concatenation of the current feature map and the feature map correspondingly cropped from the encoder, each followed by

Testing of the models in still images
First, we tested the performance of LPAIDS in identifying LPC in patients using the independent temporal image test sets from FAHSYSU.Furthermore, we used WLI and NBI images in the internal temporal image test sets.We compared the diagnostic performance of LPAIDS and model W in WLI images and that of LPAIDS and model N in NBI images.Subsequently, we assessed the robustness of LPAIDS using five external test sets from SYM-SYSU, NHSMU, FAHSU, TAHSYSU, and SAHSYSU, each with a small number of patients with LPC.

Testing of the models in the temporal video datasets and comparison with laryngologists
We used clip videos as the test sets to assess the applicability of LPAIDS in the clinic.With the guidance of a laryngeal expert, three doctoral students de-identified and clipped the videos.The length of the video clips was 8-25 s per lesion.Similarly, we used WLI and NBI videos in the temporal internal video test sets.We compared the diagnostic performance of LPAIDS and model W in WLI videos and the diagnostic performance of LPAIDS and model N in NBI videos.
For further performance evaluation of the LPAIDS, we randomly selected 200 videos (including 115 WLI and 85 NBI videos) from the temporal video test sets.Subsequently, we mixed them in a scrambled order and deidentified them.Ten laryngologists with varying degrees of expertise (expert, senior, resident, and trainee) were asked to complete 200 test videos independently, and the results were compared with those of the LPAIDS.The 10 laryngologists were involved in selecting and annotating all datasets and were blinded to the demographics and final histopathologic results of patients on the test sets.The expert laryngologist was a professor with > 20 years of experience in endoscopic procedures.The three senior laryngologists were attending doctors with more than 5 years of experience who had completed clinical and specific endoscopic training.The three laryngologist residents had more than 3 years of endoscopic experience.The three trainees were interns with 1 year of endoscopic experience.

Outcomes
The primary outcomes were the diagnostic accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of the models for identifying cancerous lesions.Accuracy was defined as the percentage of correctly classified individuals among all participants.Sensitivity and specificity were determined using the percentage of pathologically confirmed cancerous cases and that of negative controls, respectively.PPV was used to indicate the ratio of correctly predicted positive samples to all positive samples, and NPV was used to indicate the ratio of correctly predicted negative samples to all negative samples.To visually interpret the learned model, we used a heat map overlaid on the input image to examine the testing images to determine whether the salient regions in the saliency map corresponded to the region in interest for decision-making.The segmentation predictions of images comprised predicted values at each pixel, which indicated it was cancer or background.The high values led to cancer, and the low values led to the background.We assigned different colours to pixels based on the predicted values to obtain the heatmap.Additionally, we used intersection-over-union (IOU) to measure the image segmentation performance of the model.The IOU was based on the annotations and predictions of the LPC regions.

Statistical analysis
The thresholds for the final decision generated were based on the statistical data.First, we applied dilation and erosion operations to eliminate voids and noises to filter the shape of the tumour regions, which were always connected.For still images, we chose the classification threshold from 8 × 8, 16 × 16, 32 × 32, 64 × 64, 128 × 128, and 256 × 256, where 8 × 8 and 256 × 256 led to the minimum and maximum values, respectively.Threshold 32 × 32 had the best performance.Therefore, we determined that images with areas of the regions exceeding 32 × 32 pixels were classified as cancer.For videos, we selected the diagnosis threshold from different time points, including 0.5 s, 1 s, 1.5 s, 2 s, 2.5 s, and 3 s, where 0.5 s and 3 s led to the minimum and maximum values, respectively.Threshold 2 s had the best performance.Therefore, we determined that the videos with consecutive cancerous frames exceeding 2 s were diagnosed as cancer.
To assess the performance of the LPAIDS and laryngologists in identifying cancerous lesions, metrics, including accuracy, sensitivity, specificity, PPV, and NPV, were evaluated by calculating the 95% confidence interval (CI) using the Clopper-Pearson method.Performance comparison between LPAIDS and laryngologists using two-sided McNemar test.The receiver operating characteristic (ROC) curve, which was created according to the true positive rate (sensitivity) and false positive rate (1 -specificity), was employed to show the diagnostic ability of the models in discriminating patients with LPC from controls.The area under the ROC curve (AUC) value was calculated.Larger AUC values indicated better diagnostic performance.Inter-observer and intra-observer agreements of the LPAIDS and laryngologists were computed using Cohen's kappa coefficient.Statistical significance was set at p < 0•05.Statistical analyses were performed using SPSS (version 22.0; IBM, USA) or Python (version 3.7.13).

Baseline characteristics
Overall, 2382 individuals from FAHSYSU and 6806 individuals from five other hospitals were enrolled in this study.The baseline patient characteristics are shown in Table 1  We further evaluated the segmentation performance of LPAIDS in the positive pathological tissues.The LPAIDSpredicted segmented regions of LPC lesions were highly consistent with the areas labelled by laryngologists, with a median IOU of 0•698 in the internal temporal test sets (Fig. 5).

Discussion
In this study, we developed a DCNN-based intelligent diagnosis system for LPC called LPAIDS, which incorporated both WLI and NBI images to automatically identify patients with LPC and was trained and validated across six hospitals.The system showed promising diagnostic performance in six independent test sets, with satisfactory accuracy (0•949-0•984), sensitivity (0•901-0•986), specificity (0•946-0•987), and AUC values (0•965-0•987).In a human-machine competition using an independent video test set, the diagnostic performance of LPAIDS was comparable to that of expert laryngologists and outperformed those of other laryngologists with different qualifications.To the best of our knowledge, this is the most extensive study in the field of AI guided for detecting LPC lesions based on laryngopharyngeal endoscopic images.
The screening and diagnosis of laryngopharyngeal carcinoma primarily rely on laryngoscopy and pathological biopsy of the suspicious cancer tissue under the guidance of laryngoscopy [29], and this subjective examination largely depends on the skills and experience of laryngologists, which increases the possibility of missed diagnosis and repeat unnecessary biopsy.The manifestation of early LPC is subtle mucosal changes under WLI, and combined with the application of NBI, it can enhance the visualisation of submucosal microvascular morphology; thick black spots can be observed within and surrounding malignant lesions [30,31], which improves the detection rate of LPC.However, this technology suffers from a relatively long learning curve and is hampered by the need for expertise and intensive training for optical image interpretation [32].In contrast, our system can recognise WLI and NBI images simultaneously with nearly no requirements for training and experience for laryngologists, achieving a high diagnostic accuracy similar to that of experts and better than that of non-experts  in identifying LPC.This shows extraordinary potential for diagnosing LPC, particularly in developing countries or areas with an unbalanced distribution of medical resources.LPAIDS can help bridge the diagnostic gap between national and primary care hospitals and improve the diagnostic level of laryngologists lacking extensive experience and training.
Recently, in the field of endoscopy, the computer-aided diagnosis of gastrointestinal tumours has made remarkable progress [33][34][35][36].Several preliminary studies have verified the feasibility of this method in the auxiliary diagnosis of LCA.Ren et al. established a CNN-based classifier to classify laryngeal disease [23].Furthermore, Cho et al. applied a deep learning model to discriminate various laryngeal diseases except for malignancy [37].They all reported high accuracy rates.However, in these two retrospective single-institutional studies, the validation set was a small subset random self of all images in the collection.This suggests that several images of one patient were distributed across both the training and validation sets, leading to an overestimation of the test results.The training and testing of our model adopted time-series sets, and all training, validation, and testing images were collected at different periods, which were completely independent and could simulate the datasets in prospective clinical trials with more objective and convincing results.Xiong et al. developed a model based on a DCNN using WLI images to diagnose LCA with an accuracy of 0•867 [25].Additionally, He et al. developed a CNN model using NBI scans to identify patients with LCA, with an AUC of 0•873 in an independent test set [38].Their studies were based on the diagnosis of a single imaging mode, which may lead to the omission of the focal features of the lesion, weakening the performance of AI-assisted diagnosis.Furthermore, both studies were only applicable to the detection of still images, which limits their practicality in clinical applications.The clinical application of AI requires the ability to analyse and diagnose complex situations in real time.The video contains multiple angles of the lesion and more complex diagnostic settings closer to the actual clinical environment.A pilot study by Azam et al. used 624 video frames of LCA to develop a YOLO ensemble model to attempt the automatic detection of LCA in real time [24].This study focused on the automatic segmentation of tumour lesions using only LCA video frames, achieving an accuracy of 0•66 in 57 testing images, and verified the real-time processing performance of the model on six video laryngoscopes.Due to the small sample size and lack of controls, these results and their feasibility in clinical application for auxiliary diagnosis of LCA should be treated cautiously.The system we developed analysed one video frame that required only 26 ms, with an average of 38 video frames that can be identified per second, achieving the performance requirements required for real-time detection.Furthermore, our approach achieved a diagnostic accuracy of 0•949 in an independent video test set with 551 videos, demonstrating real-time dynamic recognition ability.Therefore, our system is more reliable for diagnosing LPC in real time and has a higher clinical utility than previously reported models.
Our system achieved satisfactory diagnostic performance with high accuracy on both image-test sets (0•956 [95% CI 0•951-0•960]) and video-test sets (0•949 [95% CI 0•931-0•968]), which depended on the subsequent improvement to the U-Net.We extracted two features from WLI and NBI images, respectively, which independently represented different data types, and further fused the two features.Compared with the models simply using mixed images, the LPAIDS led to more accurate predictions either in WLI or NBI images.Furthermore, integrating the two features is based on linear layers, which uses less time than feature extraction from multimodal data.The fast integration ensures that the LPAIDS can meet demanding requirements in real time.The stability and robustness of the model were validated using five other independent external validation sets.Moreover, the diagnostic performance of our system was comparable to that of experts and higher than that of non-experts.We used the Cohen kappa coefficient to assess the stability between the system and the laryngologists.We found that the expert achieved significant intra-observer consistency (k = 0•948), which was higher than that of senior laryngologists (k: 0•755-0•811), laryngologist residents (k: 0•667-0•711), and trainees (k: 0•514-0•610).
Despite these promising results, some limitations remain.First, this was a retrospective study, which may have a certain degree of selection bias, and the excellent performance of the LPAIDS cannot entirely reflect actual clinical application.Time-series sets were used to avoid such problems in the study design.Additionally, we designed and prepared a multicentre prospective randomised controlled trial to verify the applicability of this system in a clinical setting.Second, our dataset mostly comprises high-quality laryngoscopy images, which may limit the scope of use of this system.However, our test set used images acquired by different endoscopy systems from various institutions, such as Olympus and Xion, which account for most of the endoscopy market.We will collect more images of varying quality to enhance the generalisation ability of our system.Third, although we used a video test to demonstrate the real-time detection performance of the system, the clipped video only contained lesions, and the real-time application ability in actual clinical practice should be evaluated.We will further work on embedding the system into the endoscopic system to output prediction results while performing laryngoscopy and evaluating the model's reliability.

Conclusion
We developed a DCNN-based system for the real-time detection of LPCs.The system could recognise WLI and NBI imaging modalities simultaneously, achieving high accuracy and sensitivity in independent image and video test sets.The diagnostic efficiency was equivalent to that of experts and better than non-experts.However, this study still needs multicentre prospective verification to provide high-level evidence for detecting LPC in actual clinical practice.We believe that LPAIDS has excellent potential for aiding the diagnosis of LPC and reducing the burden on laryngologists.• thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year

•
At BMC, research is always in progress.

Learn more biomedcentral.com/submissions
Ready to submit your research Ready to submit your research ?Choose BMC and benefit from: ?Choose BMC and benefit from:

FAHSU:
First Affiliated Hospital of Shenzhen University; FAHSYSU: First Affiliated Hospital of Sun Yat-sen University; NHSMU: Nanfang Hospital of Southern Medical University; TAHSYSU: Third Affiliated Hospital of Sun Yat-sen University; SAHSYSU: Sixth Affiliated Hospital of Sun Yat-sen University; SYMSYSU: Sun Yat-sen Memorial Hospital of Sun Yat-sen University

FAHSYSUFig. 6
Fig. 6 ROC curves illustrating the performance of LPAIDS for identifying laryngopharyngeal cancer in multicentre imaging datasets.FAHSU: First Affiliated Hospital of Shenzhen University; LPAIDS: FAHSYSU: First Affiliated Hospital of Sun Yat-sen University; Laryngopharyngeal Artificial Intelligence Diagnostic System; NHSMU: Nanfang Hospital of Southern Medical University; ROC: receiver operating characteristic; SAHSYSU: Sixth Affiliated Hospital of Sun Yat-sen University; SYMSYSU: Sun Yat-sen Memorial Hospital of Sun Yat-sen University; TAHSYSU: Third Affiliated Hospital of Sun Yat-sen University

Fig. 7
Fig. 7 Diagnostic performance for identifying laryngopharyngeal cancer between the LPAIDS and laryngologists in 200 videos.a Receiver operating characteristic curves of LPAIDS, expert, senior, laryngologist residents, and trainees for comparison of the diagnostic performance.b Confusion matrices obtained by LPAIDS and ten laryngologists with varying degrees of expertise.Expert: a professor with > 20 years of experience in endoscopic procedures.Senior: attending doctors with more than five years of experience who had completed clinical and specific endoscopic training.Residents: residents with more than three years of endoscopic experience.Trainee: internsone year of endoscopic experience.LPAIDS: Laryngopharyngeal Artificial Intelligence Diagnostic System

Table 2
Performance comparison among LPAIDS, model W, and model N

Table 3
Performance of LPAIDS in different validation datasets

Table 4
Comparison between LPAIDS and laryngologists in 200 videos