Research ArticleCancer

Machine learning analysis of DNA methylation profiles distinguishes primary lung squamous cell carcinomas from head and neck metastases

See allHide authors and affiliations

Science Translational Medicine  11 Sep 2019:
Vol. 11, Issue 509, eaaw8513
DOI: 10.1126/scitranslmed.aaw8513

Discriminating lung primary tumors and metastases

Pulmonary metastases of head and neck squamous cell carcinoma (HNSC) are currently difficult to distinguish from primary lung squamous cell carcinomas (LUSCs). Differentiating these tumor types has important clinical implications, as whether the lung tumor is primary or has spread can affect the treatment options offered to a patient. Here, Jurmeister et al. developed a machine learning algorithm that exploits the differential DNA methylation observed in primary LUSC and metastasized HNSC tumors in the lung. Their method was able to discriminate between these two tumor types with high accuracy across multiple cohorts, suggesting its potential as a clinical diagnostic tool.


Head and neck squamous cell carcinoma (HNSC) patients are at risk of suffering from both pulmonary metastases or a second squamous cell carcinoma of the lung (LUSC). Differentiating pulmonary metastases from primary lung cancers is of high clinical importance, but not possible in most cases with current diagnostics. To address this, we performed DNA methylation profiling of primary tumors and trained three different machine learning methods to distinguish metastatic HNSC from primary LUSC. We developed an artificial neural network that correctly classified 96.4% of the cases in a validation cohort of 279 patients with HNSC and LUSC as well as normal lung controls, outperforming support vector machines (95.7%) and random forests (87.8%). Prediction accuracies of more than 99% were achieved for 92.1% (neural network), 90% (support vector machine), and 43% (random forest) of these cases by applying thresholds to the resulting probability scores and excluding samples with low confidence. As independent clinical validation of the approach, we analyzed a series of 51 patients with a history of HNSC and a second lung tumor, demonstrating the correct classifications based on clinicopathological properties. In summary, our approach may facilitate the reliable diagnostic differentiation of pulmonary metastases of HNSC from primary LUSC to guide therapeutic decisions.


The prognosis of patients with head and neck squamous cell carcinoma (HNSC) is mostly limited by the presence of distant metastases that mainly occur in the lung (1, 2). However, these patients are also at high risk of developing a second squamous cell carcinoma of the lung (LUSC), as both cancers share similar epidemiology and risk factors (35). Further compounding the problem, pulmonary metastases of HNSC and primary LUSC regularly share the same histomorphology and immunohistochemical profiles (6). Therefore, when a patient with a history of HNSC is diagnosed with a synchronous or metachronous squamous lung tumor, it is in many cases impossible to distinguish metastatic HNSC from a second independent tumor originating in the lung. It is thus difficult to accurately assess the proportion of HNSC metastases within these patients, which is reflected by the broad range reported in the literature from 17 to 63% of HNSC patients where an additional squamous tumor of the lung was determined to be a metastasis (79). However, the differentiation of these two diseases is of central importance to determining optimal treatment strategies and to correctly assessing the prognosis of the individual patient. Whereas distant metastases of HNSC are mostly incurable and patients mainly receive only palliative chemotherapy or radiotherapy, patients with locally limited LUSC normally qualify for potentially curative therapy including lung lobectomy (1012).

With recent technical advances in molecular pathology, methods using next-generation sequencing (NGS), RNA sequencing, and proteomics have been proposed to address this diagnostic dilemma. Because mutational profiles show substantial overlap across different cancer types (13, 14), it is likely that only a direct comparison of the mutational profile using NGS of both tumors may help distinguish HNSC metastases from second lung cancers in the event vast overlaps or differences between the two mutational profiles are found (15). However, in clinical routine, this comparative approach might fail if the primary tumor is not suitable for molecular analysis (for example, if the tumor tissue sample is used up or unavailable due to external pathology, degradation due to age, or decalcification) or if the NGS panel does not cover mutations that would allow differentiation between a common versus distinct tumor origin. More recently, RNA sequencing– or proteomics-based signatures that are specific for the tissue of origin have been suggested to avoid the need to analyze and directly compare the original HNSC and lung tumors (7, 16, 17). However, with accuracy rates below 90%, these methods are of limited utility for implementation in routine diagnostics.

The DNA methylation signatures of different tissue types are known to be quite specific, which has resulted in the recent development of promising algorithms that can characterize cancers of unknown primary tumors, brain tumors, and sarcomas according to their epigenetic signatures (1820). On the basis of these results, we aimed to develop a DNA methylation–based machine learning classifier that facilitates the diagnostic differentiation of HNSC metastases to the lung from primary LUSC.


Clinical cohort

In a retrospective analysis of cases from our institution, we identified 408 patients with a history of primary HNSC and a synchronous or metachronous squamous lung tumor. The results from histopathological evaluation, molecular analyses, as well as clinical and radiological information were considered and discussed by the interdisciplinary tumor board of the Charité Comprehensive Cancer Center. Figure 1A shows two example cases in which the board made a diagnosis based on conventional clinicopathological information. Although the tumors showed similar histomorphology, one case revealed concordant p16 and p53 expression in HNSC and the lung tumor and had multiple peripheral lung tumors, indicating metastatic disease. In the other case, discordant p16 expression was observed and the solitary lung tumor was located at the main bronchus, indicating a primary lung tumor. Despite thorough analysis and discussion by the board, 344 cases (84.3%) remained unsolved due to contradictory information. Consensus regarding the tumor’s origin was reached in only 64 cases (15.7%) (Fig. 1B). Thirty-eight tumors (59.4%) were classified as pulmonary metastases of HNSC, and a second LUSC was assumed in 26 cases (40.6%). For 54 of these 64 specimens, a sufficient amount of tumor material was available for further analysis. We excluded three cases from the dataset as the derived DNA methylation data did not pass quality control, resulting in a clinical cohort of 51 samples, the clinical and histopathological details of which are summarized in table S1 (in data file S1). As expected, patients who were diagnosed with HNSC metastases had a significantly shorter disease-specific survival than those with a second LUSC (P = 0.0016; Fig. 1C).

Fig. 1 Retrospective evaluation of patients with HNSC with a synchronous or metachronous squamous lung tumor.

(A) Two cases of HNSC patients with an additional squamous lung tumor. In case 21, a pulmonary metastasis of HNSC was diagnosed because of similar morphology on hematoxylin and eosin (H&E)–stained slides, concordant immunohistochemistry of p16 (negative) and p53 (mutated pattern), as well as the peripheral localization of the tumor (arrow). Case 33 was diagnosed as a second LUSC because of different p16 expression (HNSC positive, lung tumor negative) and the central localization of the tumor (arrow) with connection to a main bronchus. Scale bars, 50 μm. CT, computed tomography. (B) In our retrospective analysis, we identified 408 patients with a history of HNSC and a synchronous or metachronous lung tumor. After reviewing all available clinicopathological data and discussion within the tumor board, 26 (6.4%) were considered as second LUSC and 38 (9.3%) as HNSC metastases. (C) Kaplan-Meier plot of 51 cases that were solved on the basis of clinicopathological data and had enough tumor tissue available for further analysis.

Development and comparison of different machine learning classifiers on primary tumor samples

As a first step, we used primary HNSC and LUSC samples to identify characteristic epigenetic signatures of the respective tumor type. This reference cohort (n = 1071) consisted of primary HNSC, primary LUSC, and normal lung tissue samples from The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO) databases. By also including normal lung tissue, we provided a control mechanism to recognize samples that were contaminated by adjacent lung stroma. Otherwise, a pulmonary metastasis of HNSC with a low tumor cell content might have falsely been considered as a LUSC.

In a t-distributed stochastic neighbor embedding (t-SNE) analysis of this reference cohort, HNSC and LUSC samples formed two roughly different groups with considerable overlap and no clear separation (Fig. 2A). We observed two distinct groups for the normal lung tissue specimens, representing mixed lung tissue from the TCGA dataset and alveolar epithelial cells from a GEO dataset. Moreover, two LUSC samples fell into the normal lung tissue group. We found a more pronounced separation of LUSC versus HNSC in cases with higher tumor purity as assessed by the tumor purity estimation method “ESTIMATE” (21) based on gene expression data (Fig. 2B), as well as the DNA methylation–based “InfiniumPurify” (22) method (Fig. 2C). Additional methods to estimate tumor purity are shown in fig. S1. LUSC cases with low tumor cellularity resembled normal lung tissue, as expected. With regards to human papillomavirus (HPV) status, HPV-positive cases accumulated in a distinct subgroup (Fig. 2D), suggesting that these tumors were epigenetically different from HPV-negative specimens. The samples in the HPV subgroup also had a relatively low incidence of TP53 mutations (Fig. 2E). Among HNSC and LUSC, we found no subgroups that associated with shorter overall survival (Fig. 2F). Annotation of the HNSC site of origin revealed three relatively distinct subgroups representing tumor originating from the oropharynx, the oral cavity, and the larynx (Fig. 2G). Similar to t-SNE, hierarchical clustering revealed that normal lung samples were distinct from the tumor samples, but there was no clear separation of HNSC and LUSC (fig. S2).

Fig. 2 t-SNE plots of the training cohort, consisting of primary HNSC, LUSC, and normal lung tissue.

(A) General t-SNE plot showing the tissue type. (B and C) t-SNE plots showing the association with estimated tumor cell purity using the RNA-based “ESTIMATE” method (B) and the DNA methylation–based “InfiniumPurify” method (C). (D) t-SNE plot showing the human papillomavirus (HPV) status of the training cohort. (E) t-SNE plot visualizing the TP53 mutational status. (F) Annotation of survival time reveals no prognostic subgroups. For the survival analysis, patients with an event-free follow-up of less than 24 months were censored. (G) t-SNE plot showing the distribution of the different HNSC tumor origins.

For further downstream analysis, we selected the 2000 most variable CpG sites. In a gene enrichment analysis, 128 of 22,132 tested Gene Ontology (GO) categories were significantly enriched in the 2000 top variable CpG sites after multiple testing correction with the Benjamini-Hochberg (BH) method (table S2 in data file S1). The enriched categories included GO terms related to tissue differentiation such as “system development” (p-BH = 5.98 × 10−12), “embryonic morphogenesis” (p-BH = 9.43 × 10−8), and “cell differentiation” (p-BH = 0.001). GO terms representing transcriptional factor activity, such as “sequence-specific DNA binding,” were also significantly enriched (p-BH = 6.47 × 10−12).

Using these 2000 CpG sites, we developed three different machine learning classifiers based on artificial neural networks, support vector machines, and random forests. We trained these algorithms to classify tumor samples with respect to their organ of origin based on their DNA methylation profile and provide probability scores of the classification results. All three classifiers were tuned using fivefold cross-validation on the reference cohort. We then applied the resulting models to an independent validation cohort consisting of primary HNSC and LUSC tumor samples from different GEO datasets and our own studies, amounting to 279 samples in total (Fig. 3A). This publicly available validation dataset comprised DNA methylation data from diverse sources, including different material types, DNA methylation array chips, and preprocessing methods.

Fig. 3 Classification results of the machine learning algorithms on an independent validation cohort.

(A) Overview of the samples in the validation cohort (n = 279). (B to D) Threshold analysis for the prediction score of the neural network (B), the support vector machine (C), and the random forest (D). (E to G) Confusion matrices for the neural network (E), the support vector machine (F), and the random forest (G) comparing the true class with the predicted class for the HNSC (n = 110), LUSC (n = 150), and normal lung tissue (n = 19) samples. (H to J) Distribution of probability scores (range, 0 to 1) for the correct class for the neural network (H), the support vector machine (I), and the random forest (J) over different tissue types, datasets, material types [formalin-fixed paraffin-embedded (FFPE) versus frozen], DNA methylation chip arrays (450k and EPIC), and preprocessing (Preproc.) methods (Noob versus Illumina).

Classification accuracy

Assigning each sample to the category with the highest probability score, the artificial neural network and the support vector machine correctly classified 96.4 and 95.7% of all cases in the validation cohort, respectively. The random forest fell short of these results, achieving an accuracy rate of only 87.8%. The corresponding three-class areas under the curve (AUCs) were 0.9934, 0.9915, and 0.9708 for the artificial neural network, the support vector machine, and the random forest classifier, respectively. Further, the positive predictive values were 96.4% (HNSC) and 96.7% (LUSC) for the neural network, 98.1% (HNSC) and 94.8% (LUSC) for the support vector machine, and 80.6% (HNSC) and 94.6% (LUSC) for the random forest classifier. To explore whether higher prediction accuracies could be achieved for subsets of cases, we included the confidence of the predictions (probability scores) in our analysis and excluded samples with low scores, which were considered to be unclassifiable (“no match”). By increasing the threshold for the minimally required score, the accuracies could be increased to more than 99% for subsets of samples for all three algorithms. As the distribution of the probability scores was different for each of the three methods, we defined thresholds yielding an accuracy of 98 and 99% if the corresponding samples had a probability score above this threshold calculated by the neural network (Fig. 3B), the support vector machine (Fig. 3C), or the random forest (Fig. 3D). Using probability score cutoffs of 0.8 and 0.95, the accuracy rates of the artificial neural network increased to 98.1 and 99.2%, respectively, while 96.4 and 92.1% of the samples still had probability scores above the threshold (Fig. 3E). The remaining samples were considered not classifiable (no match). For the support vector machine, using thresholds of 0.7 and 0.85, accuracies of 98.1 and 99.2% were achieved for 95.0 and 90.0% of the samples, respectively (Fig. 3F). For the random forest algorithm, with thresholds 0.7 and 0.8, accuracies of 98.2 and 99.2% were reached for 60.9 and 43.0% of all samples, respectively (Fig. 3G). This shows that although high subset prediction accuracies could be achieved for all three methods, the artificial neural network approach resulted in high accuracy predictions for the largest number of cases.

Next, we analyzed the distribution of the probability scores across different tissue types, studies, material types [formalin-fixed paraffin-embedded (FFPE) versus frozen], DNA methylation chip arrays (EPIC versus 450k), and preprocessing methods for the neural network (Fig. 3H), the support vector machine (Fig. 3I), and the random forest (Fig. 3J). The normal lung tissue samples reached slightly lower scores than the tumor samples for all three algorithms. Besides that, the study, material type, DNA methylation chip array, and preprocessing method had no major influence on prediction accuracies for the artificial neural network and the support vector machine, demonstrating that these methods are robust to possible confounding factors. Although the classifiers had only been trained on frozen samples, the FFPE specimens were assigned to the correct class with even higher accuracy than the frozen samples. Unlike the artificial neural network and the support vector machine, random forest scores were influenced by the abovementioned factors, which might indicate that this method generalizes less well to data types that differ from those that have been used as a training set.

Validation in an independent clinical cohort

As the three different classifiers were developed and validated in this study on primary tumor samples, we aimed to further verify our results on an additional cohort resembling a clinical setting. As there is no gold standard to differentiate HNSC and LUSC, we selected an independent clinical cohort consisting of 51 suitable patients from the archives of the Charité Institute of Pathology (Fig. 4A). In a t-SNE plot including all cases from the reference cohort and the clinical cohort, the clinical validation cases tended to accumulate in the reference cohort groups that were expected based on their clinicopathological annotation (Fig. 4B).

Fig. 4 Classification results in the clinical cohort.

(A) Heat map of classification results for the neural network (NN), support vector machine (SVM), and random forest (RF) alongside with clinicopathological features that were considered to reach consent by the tumor board. (B) t-SNE analysis of the samples from the clinical cohort (n = 51) and the reference cohort (n = 1071) featuring LUSC (n = 354), HNSC (n = 528), and normal lung samples (n = 189).

Applying our classifiers to this clinical cohort yielded raw accuracies of 98.0, 96.1, and 84.3% and two-class AUCs (LUSC versus HNSC) of 1.0, 1.0, and 0.976 for the artificial neural network, support vector machine, and random forest, respectively (Fig. 4A). The positive predictive values were 100% (HNSC) and 95.5% (LUSC) for the neural network, 100% (HNSC) and 91.3% (LUSC) for the support vector machine, and 92.6% (HNSC) and 78.2% (LUSC) for the random forest classifier. All misclassified samples had comparably low probability scores (table S1 in data file S1). Considering those samples with prediction scores above the previously defined threshold, the accuracy, positive predictive value, and the AUC were increased to 100% (fig. S3). Again, the artificial neural network achieved the highest prediction accuracy for the largest number of samples. As expected, the classification results were also associated with disease-specific survival, except for the random forest classifier (fig. S4).

Furthermore, we applied our classifier to the four TCGA patients with LUSC that also had a history of HNSC. These samples were not part of the reference or the validation cohort, as we excluded patients with previous malignancies that could metastasize to the lung and represent or mimic squamous differentiation. Two tumors were classified as HNSC metastases with high prediction scores in all three machine learning methods (table S3 in data file S1), suggesting that these tumors might have been misdiagnosed. Our results are supported by the clinical follow-up data of these cases, as the two patients with a predicted HNSC metastasis had a relatively short overall survival despite having a relatively low clinical stage if the tumor was considered a primary LUSC (table S3 in data file S1). In addition, HPV16 was detected in one of these samples, representing the only HPV-positive tumor in the entire TCGA LUSC cohort (23), further indicating that this specimen is most likely a pulmonary metastasis of the previously diagnosed HNSC.


With this study, we successfully demonstrate that DNA methylation profiling in conjunction with machine learning solves the diagnostic problem of differentiating lung metastases of squamous cell carcinomas of the head and neck from primary lung cancers arising in patients with previous or simultaneous HNSC. In the clinical database of the Charité University Hospital Berlin, we identified a substantial number of patients with HNSC who had a synchronous or metachronous squamous lung tumor. In line with common clinical knowledge, we demonstrated that currently established diagnostic methods such as histomorphology, immunohistochemistry, and thorough clinical and radiological investigation failed to resolve this problem in the majority of cases.

Previous studies that tried to address this issue using mRNA expression or proteomics achieved accuracy rates between 82.9 and 86.8% (7, 16, 17). In our study, we were able to demonstrate the superiority of DNA methylation profiling in differentiating primary squamous cell carcinomas from the lung region and the head and neck region. We also compared different machine learning techniques to identify the most suitable method to interpret our DNA methylation data. Even our worst-performing model equaled the best accuracy rates of previous studies that were based on proteomics (7). Using an artificial neural network, we were able to correctly classify 96.4% of primary tumor samples from an independent and diverse validation cohort. The classification accuracy can be further improved for subsets of patients by using probability scores indicating the confidence of the results provided by the classifier.

This leads to the second major aspect of our study, in which we compared the performance of random forest, support vector machine, and neural network–based classifiers to interpret DNA methylation data. Different methods have previously been used to develop DNA methylation–based diagnostic classifiers, including random forests and prediction analysis of microarrays (PAM) classifiers (19, 20, 24). The most successful and commonly used classifiers have been based on random forests (19). In these previous studies, most t-SNE analyses revealed clearly separated groups for each entity or subgroup. However, in our study, unsupervised techniques such as t-SNE or hierarchical clustering failed to clearly separate HNSC and LUSC samples. The application of a random forests classifier on our dataset resulted in only mediocre overall accuracy. Furthermore, the use of tail probabilities led to the exclusion of more than 50% of the samples with random forests. Therefore, we also trained artificial neural network and support vector machine classifiers. Overall, the artificial neural network was slightly superior to the support vector machine and both methods largely excel the random forest classifier. In particular, the artificial neural network yielded better results across datasets from different sources, which might be due to batch effects. This advantage is of tremendous importance for application in a diagnostic setting, which requires robust results independent of potentially confounding factors such as interlaboratory variability or other batch effects. Moreover, neural networks can be pretrained on unlabeled data, which might, even more, increase the prediction accuracies in future studies.

A limitation of our work is the fact that DNA methylation analysis is not yet as widely deployed as other molecular methods (for example, NGS), restricting the immediate broad diagnostic utility of our classifier. However, several publications recently demonstrated the emerging importance of this method to correctly characterize different cancers such as brain tumors or sarcomas (19, 20). Therefore, a broader establishment of DNA methylation analysis is probably imminent. Another disadvantage of this technique is the relatively large amount of optimal DNA input (500 ng) recommended by the manufacturer. In theory, this could potentially limit the applicability of this approach with respect to small biopsies. However, it has been shown that DNA methylation analysis of small samples still delivers reliable results (25). In line with these reports, we successfully analyzed and classified 15 biopsy specimens with a DNA input as low as 110 ng. Another limitation of DNA methylation analysis is the potential interference with adjacent benign tissue. As evident from our t-SNE analysis, low tumor cellularity could potentially complicate the classification of the tumor origin by causing low probability scores. Tumor purity has already been described as a limiting factor for the classification in brain tumors, leading to a dropout rate of approximately 4% (19). However, tumor purity is routinely checked histologically and low probability scores or classification as normal tissue points the pathologist to problematic cases. Last, the system has so far not been trained for the classification of squamous cell carcinomas from other anatomic sites. To what extent our approach can be applied to identify other tumor origins will have to be analyzed in future studies.

Despite these limitations, the method described in this study delivers a so far unmatched accuracy in distinguishing pulmonary metastases of HNSC from primary LUSC. In contrast to DNA methylation–based classifiers for cancers of unknown primary, our approach is specifically tailored to differentiate HNSC from LUSC. By this means, we were able to achieve accuracy rates that could actually qualify this approach for use in routine diagnostics. Furthermore, by using a reference approach, we can avoid the disadvantages of currently established analyses, which rely on a comparative evaluation (for example, immunohistochemistry, HPV genotyping, and NGS) of both tumors. In contrast to our proposed reference-based single analysis, comparative analyses of the HNSC and the lung tumor are more costly and often tissue samples of the original HNSC tumor are not readily available.

We were also able to further validate our results in an independent clinical cohort of patients with a history of HNSC and a synchronous or metachronous squamous lung tumor who underwent diagnosis and treatment at our institution. We successfully tested our classification against the fact that metastatic disease is prognostically unfavorable compared to a second (primary) lung tumor.

A gene set enrichment analysis revealed that the 2000 most variant CpG sites that were chosen for the classifier training were significantly enriched for GO terms related to tissue differentiation. This suggests that specific epigenetic marks acquired during tissue differentiation might be particularly important for the DNA methylation–based classification.

In summary, our study presents a highly accurate and robust machine learning–based DNA methylomics approach capable of differentiating pulmonary metastases of HNSC from primary LUSC. In addition, our study demonstrated that, compared to random forests, support vector machines and particularly neural networks were superior in solving this diagnostic dilemma.


Study design

Our main research objective was to build a diagnostic classifier able to distinguish pulmonary metastases of HNSC from LUSC based on their DNA methylation profiles. DNA methylation array data were obtained from publicly availably sources and FFPE tissue from our archives. The experimental design was retrospective. We compared three different machine learning algorithms (artificial neural networks, support vector machines, and random forests) to identify the optimal method. All algorithms were trained on a reference cohort of primary tumors and normal lung tissue (n = 1071), with 528 HNSC, 354 LUSC, and 74 normal lung tissue samples obtained from TCGA as well as 115 normal tissue samples from the GEO. We then evaluated the classifiers on a validation cohort of 110 primary HNSC, 150 primary LUSC, and 19 normal lung tissue samples obtained from six GEO datasets as well as from the archives of the Institute of Pathology at the Charité University Hospital Berlin. We further validated our method in an independent clinical cohort of 51 patients from our archives with synchronous or metachronous squamous cell lung tumor whose diagnosis could reliably be determined based on clinicopathological and molecular characteristics. No information about the samples in the validation cohort or the clinical cohort was used at any point for the development of the classifiers.

Patients and samples

Cases with a history of HNSC and a synchronous or metachronous squamous lung tumor (n = 408) were identified using the electronic patient files and the electronic database of the Charité University Hospital Berlin. All cases were discussed by the interdisciplinary lung or head and neck tumor boards of the Charité Comprehensive Cancer Center, which consists of medical oncologists, radiation oncologists, surgeons, pathologists, and radiologists. For all cases, a variety of clinical, radiological, histopathological, and molecular data were discussed (Fig. 4A and table S1 in data file S1). Consent regarding the lung tumor’s origin was reached in 64 of 408 samples.

Matching FFPE tissue available for 54 cases was retrieved from the archives of the Institute of Pathology at the Charité University Hospital Berlin. All samples were reevaluated on the basis of hematoxylin and eosin (H&E) stains, and additional immunohistochemical investigations were performed if needed. Clinical data were extracted from the electronic patient files and the Berlin and Brandenburg Clinical Cancer Registry.


Immunohistochemical staining was performed on a VENTANA BenchMark XT automated slide stainer according to the manufacturer’s instructions. Antibodies, their manufacturers, as well as concentrations and scoring systems are listed in table S4.

DNA extraction

Representative tumor areas were identified using light microscopy. If necessary, macrodissection was performed to reach a tumor cell content of at least 70%. Semi-automated DNA extraction was performed according to the manufacturer’s instructions (Maxwell RSC FFPE Plus DNA Purification Kit, Custom, Promega). DNA quantities were measured using Qubit HS DNA assay (Thermo Fisher Scientific).

HPV genotyping

HPV genotyping was performed using the HPV Type 3.5 C LCD array (Chipron) according to the manufacturer’s instructions. Data on HPV genotyping in the TCGA cohort were derived from (23).

Next-generation sequencing

Ion AmpliSeq Library Kit 2.0 (Thermo Fisher Scientific) was used to perform library preparation with 10 ng of genomic DNA using the Ion AmpliSeq Cancer Hotspot Panel v2 (Thermo Fisher Scientific). The final library was quantified with the Ion Library Quantitation Kit (Thermo Fisher Scientific). Samples were multiplexed and amplified on Ion Spheres Particles with Ion 530 Kit-Chef and were sequenced using Ion 530 Chip (Thermo Fisher Scientific) with an adapted standard protocol using 330 flows (26).

Tumor purity estimation

Estimations for the tumor purity were based on gene expression profiles (“ESTIMATE”), somatic copy number data (“ABSOLUTE”), DNA methylation (“LUMP” and “InfiniumPurity”), (21, 22), visual quantification of H&E slides, as well as a consensus measurement considering all of the mentioned methods (“consensus measurement of purity estimations”) (21).

DNA methylation analysis

The Infinium HD FFPE DNA Restore Kit was used for DNA restoration of FFPE samples. DNA methylation analysis was performed using the Illumina Infinium MethylationEPIC BeadChip, according to protocols supplied by the manufacturer.

Reference cohort

Raw DNA methylation data (IDAT files) from 528 primary HNSC, 370 primary LUSC, and 74 normal lung tissue samples were obtained from TCGA (2729). LUSC cases with a documented previous or synchronous tumor that could potentially result in a pulmonary metastasis of squamous-like differentiation (for example, HNSC, squamous cell skin cancer, and melanoma) were excluded from the dataset (n = 16). Patients with previous malignancies that did not mimic squamous cell carcinomas (such as lymphoma) or that could not metastasize to the lung (for example, basal cell carcinoma) were not excluded. We completed the reference cohort with a dataset of alveolar epithelial cells from 115 patients (GSE85566) (30). Only samples from the reference cohort were used for tuning of the classification algorithms and definition of the most variable CpG sites.

Validation cohort

We tested the classification algorithms on DNA methylation data from 105 primary HNSC, 145 primary LUSC, and 19 normal lung tissue samples from additional six different GEO datasets (GSE56044, GSE79556, GSE95036, GSE66836, GSE39279, and GSE87053) (3136). For GSE39279, raw IDAT files were not available; therefore, we used the beta values preprocessed with the Illumina method. Furthermore, we also analyzed five primary HNSC and five primary LUSC samples from our own archives.

Clinical cohort

We identified 54 samples from patients with a history of HNSC and a synchronous or metachronous squamous cell lung tumor whose diagnosis could reliably be determined based on histopathological investigation, molecular analyses, and clinical and radiological data. Thirty-four cases were regarded as pulmonary metastases of HNSC and 20 as a second LUSC. These samples were analyzed using the Illumina Infinium MethylationEPIC BeadChip. Three samples were excluded after data generation because of low global quality scores (median detection P value > 0.01). Case 43 was initially suspected to be a pulmonary metastasis because of the detection of HPV16 in the primary HNSC as well as the lung tumor. However, there was a considerably long time interval between the diagnosis of both cancers (121 months) and a discrepancy regarding p53 immunohistochemistry, and the patient was still alive after more than 5 years of follow-up. To resolve this case, we performed comparative NGS. We discovered mutations in the FGFR3, RB1, and NOTCH1 gene only in the HNSC sample as well as a TP53 mutation that was exclusive to the lung tumor (table S5). There was no overlap of pathogenic mutations, thus further supporting the pulmonary origin of this tumor despite the presence of HPV16.

Classifier development

We trained artificial neural networks, a support vector machine, and a random forest classifier on the reference cohort and tuned the parameters using fivefold cross-validation. The validation cohort and the clinical cohort were not used at any point for variable selection or for model selection. To determine the minimal number of required CpG sites (Nsites), all models were trained on the top 100, 200, 500, 1000, 2000, 5000, 10,000, and 20,000 most variable CpG sites. Neural networks were built with the R package keras (37). The optimal number of hidden layers (search range: 1, 2, 3), neurons per hidden layer (64, 128, 256), the dropout ratio (0.1, 0.3, 0.5), and the nonlinearity (tanh, ReLU) were determined by fivefold cross-validation. The Adam optimizer (38) was used with default parameters and a cross-entropy loss. A support vector machine with Gaussian radial basis function (RBF) kernel was used as implemented in the R package kernlab (39). Optimal parameters for the variance of the RBF kernel sigma (1/Nsites × 2–5,…,5) and the amount of regularization C (20,…,5) were selected using fivefold cross-validation and cross-entropy loss. A random forest classifier with 2000 trees was used as implemented in the R package randomForest (40). The inclusion of more trees did not reduce cross-validation loss. The number of features at each split mtry was optimized with fivefold cross-validation (mtry = ⌈ √ (Nsites)⌉ × 2–5,…,5), with cross-entropy loss and the sample size sampsize set to (189, 189, 189). All other parameters were kept at the default values.

Selection of final classifier models

With increasing numbers of sites, the cross-validation loss reached a plateau at 2000 CpG sites for all three models (fig. S5). Hence, with 2000 CpG sites, the models had low cross-validation error and comparatively low complexity and were thus less prone to overfitting than more complex models including more CpG sites, in line with machine learning theory (41). Therefore, we selected the top 2000 variable CpG sites to train all three classification models.

An artificial neural network with three hidden layers of 64 units and a dropout ratio of 0.3 achieved the lowest cross-validation loss and was subsequently used. The final classifier was an ensemble obtained by averaging the output probabilities of the neural networks trained on the five different cross-validation folds. In our experiments, the ensemble had, on average, a higher accuracy than the individual models on both the validation and clinical cohort. Ensembles are known to reduce the classifier variance and thus the risk for overfitting; this makes them attractive to use specifically in domains where only moderate amounts of data are available (38, 41, 42). For support vector machines, the optimal parameters were sigma = 6.25 × 10−5 and C = 32. For the random forest classifier, the cross-validation resulted in mtry = 1408. Probability scores were computed for the prediction task. All fitted models are available in the Supplementary Materials.

Rejection class analysis

To increase classification accuracy, we defined a rejection class based on thresholds of minimally required probability scores. The probability scores computed by the classifier provide a measure of the confidence of the classifier results. As the output scores of the classifiers are not calibrated, the distribution of these probabilities in the validation cohort differed between the different algorithms; hence, it was not possible to define a unique probability threshold for all three methods. Therefore, for each threshold in 0.5, 0.55, …, 0.95, we computed the accuracy if we excluded samples with probability scores below the threshold (Fig. 3). This allowed defining method-specific thresholds, which resulted in a prediction accuracy of more than 98 and 99%, respectively, for all samples with scores above these thresholds. Specimens below the threshold were considered not classifiable (no match).

t-SNE, heat maps, receiver operating characteristics, and survival analysis

All t-SNE plots were performed using the R package Rtsne (43) with 2000 iterations and a perplexity of 30. Heat maps were created using the ComplexHeatmap (42) package using average linkage and Euclidian distance as similarity measure. Receiver operating characteristic curves (ROCs) and AUCs were computed with the R package pROC (44). Multi-class AUCs were computed as defined by Hand and Till (45).

Gene set enrichment analysis

We performed gene set enrichment analysis of the top 2000 variable CpG sites using the function gometh from the R-package missMethyl (46). Multiple testing correction was applied using the BH method (false discovery rate < 5%).

Statistical analysis

All data analyses were performed using the statistical programming language R (47), including the packages caret (48), kernlab (39), randomForest (40), keras (37), and Rtsne (43). DNA methylation data were processed using the minfi (49) package and single-sample normal-exponential out-of-band (Noob) normalization (50). As described previously (19), we excluded CpG sites associated with single-nucleotide polymorphisms or the sex chromosomes as well as sites that have previously been reported as cross-reactive (19, 51). Survival curves were generated using the Kaplan-Meier method and tested for significance using the log-rank test. P values of <0.05 were considered statistically significant.


Fig. S1. t-SNE analysis of additional methods to estimate tumor purity in the reference cohort.

Fig. S2. Heat map analysis of the top 2000 variable CpG sites from the reference cohort.

Fig. S3. Overview of the probability scores of the three machine learning algorithms on the clinical cohort.

Fig. S4. Prognostic relevance of the diagnosis predicted by the three different classifiers in the clinical cohort.

Fig. S5. Dependency of the cross-validation loss in the reference cohort on the number of CpG sites included in the classification model for the three different classification methods.

Data file S1 contains tables S1 to S3:

Table S1. Detailed metadata and probability scores for the samples included in the clinical cohort.

Table S2. Gene enrichment analysis of the 2000 top variable CpG sites of the reference cohort.

Table S3. Metadata and detailed classification results for LUSC cases from TCGA with previous HNSC.

Table S4. Antibodies used for immunohistochemistry.

Table S5. Results from comparative NGS of cases 31 and 43.

References and Notes

Acknowledgments: We thank J. Staaf for providing raw IDAT files from primary LUSC samples. We thank H. Bläker for providing excellent consultations during the histomorphological reevaluation. We gratefully acknowledge the excellent technical assistance of A. Förster, V. Arnemann, B. Meyer-Bartell, P. Wolkenstein, and P. Jank. The results shown here are, in part, based on data generated by the TCGA Research Network ( Funding: This work was supported by the European Fund for Regional Development (EFRE) and the Federal State of Berlin (1303/2013) (to P.J.), the German Federal Ministry of Education and Research (MALT3, BBDC, BZML) (to M.B., F.K., K.-R.M., and P.S.), the Fördergemeinschaft Kinderkrebszentrum Hamburg (to M.B. and U.S.), the Berlin Institute of Health (BIH) Charité Clinician Scientist Program funded by the Charité–Universitätsmedizin Berlin and the Berlin Institute of Health (to M.v.L.), and the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Government of South Korea (2017-0-00451; to K.-R.M.). Author contributions: M.B., D.C., F.K., and P.J. designed the research goals and aims. The methodology was developed and designed by M.B., D.C., G.M., P.J., and P.S. The formal analysis was performed by M.B., P.J., P.S., and K.-R.M. M.B., P.J., D. Teichmann, D. Treue, and C.V. performed the investigations. Resources were provided by D.C., F.K., and K.-R.M. Data curation was performed by A.A., K.B., M.B., T.B., P.J., D. Teichmann, and C.V. M.B. and P.J. wrote the original draft of the paper. All authors reviewed and edited the manuscript. Visualizations were created by M.B. and P.J. D.C., F.K., and K.-R.M. supervised the project. Competing interests: D.C. is listed as inventor on the patent application “DNA-methylation based method for classifying tumor species” (PCT/EP2016/055337) filed by Deutsches Krebsforschungszentrum Stiftung des öffentlichen Rechts and Ruprecht-Karls-Universität Heidelberg. Data and materials availability: All data associated with this paper can be found in the main text or the Supplementary Materials. IDAT files of the samples analyzed in this paper have been deposited in GEO (GSE124052). The code and the data necessary to reproduce the main analyses from this paper are available at

Stay Connected to Science Translational Medicine

Navigate This Article