Research ArticleInfectious Disease

Host gene expression classifiers diagnose acute respiratory illness etiology

See allHide authors and affiliations

Science Translational Medicine  20 Jan 2016:
Vol. 8, Issue 322, pp. 322ra11
DOI: 10.1126/scitranslmed.aad6873

Resisting antibiotics

No matter the cause, acute respiratory infections can be miserable. Indeed, these infections are one of the most common reasons for seeking medical care. A clear diagnostic can help medical practitioners resist the patient-induced pressure to prescribe antibiotics as a catch-all therapy, which increases the risk of bacteria developing antibiotic resistance. Now, Tsalik et al. report clear differences in host gene expression induced by bacterial and viral infection as well as by noninfectious illness. These differences can be used to discriminate between these groups, and a host gene expression classifier may be a helpful diagnostic platform to curb unnecessary antibiotic use.


Acute respiratory infections caused by bacterial or viral pathogens are among the most common reasons for seeking medical care. Despite improvements in pathogen-based diagnostics, most patients receive inappropriate antibiotics. Host response biomarkers offer an alternative diagnostic approach to direct antimicrobial use. This observational cohort study determined whether host gene expression patterns discriminate noninfectious from infectious illness and bacterial from viral causes of acute respiratory infection in the acute care setting. Peripheral whole blood gene expression from 273 subjects with community-onset acute respiratory infection (ARI) or noninfectious illness, as well as 44 healthy controls, was measured using microarrays. Sparse logistic regression was used to develop classifiers for bacterial ARI (71 probes), viral ARI (33 probes), or a noninfectious cause of illness (26 probes). Overall accuracy was 87% (238 of 273 concordant with clinical adjudication), which was more accurate than procalcitonin (78%, P < 0.03) and three published classifiers of bacterial versus viral infection (78 to 83%). The classifiers developed here externally validated in five publicly available data sets (AUC, 0.90 to 0.99). A sixth publicly available data set included 25 patients with co-identification of bacterial and viral pathogens. Applying the ARI classifiers defined four distinct groups: a host response to bacterial ARI, viral ARI, coinfection, and neither a bacterial nor a viral response. These findings create an opportunity to develop and use host gene expression classifiers as diagnostic platforms to combat inappropriate antibiotic use and emerging antibiotic resistance.


Respiratory tract infections caused 3.2 million deaths worldwide and 164 million disability-adjusted life years lost in 2011, more than any other causes (1). Despite a viral etiology in most cases, 73% of ambulatory care patients in the United States with acute respiratory infection (ARI) are prescribed an antibiotic, accounting for 41% of all antibiotics prescribed in this setting (2, 3). Even when a viral pathogen is microbiologically confirmed, this does not exclude a possible concurrent bacterial infection leading to antimicrobial prescribing “just in case.” This empiricism drives antimicrobial resistance (4, 5), recognized as a national security priority (6).

The host’s peripheral blood gene expression response to infection offers a diagnostic strategy complementary to those already in use. This strategy has successfully characterized the host response to viral (712) and bacterial ARI (10, 13). Despite these advances, several issues preclude their use as diagnostics in patient care settings. An important consideration in the development of host-based molecular signatures is that they be developed in the intended use population (14). However, nearly all published gene expression–based ARI classifiers used healthy individuals as controls and focused on small or homogeneous populations. Furthermore, the statistical methods used to identify gene expression classifiers often include redundant genes based on clustering, univariate testing, or pathway association. These strategies identify relevant biology but do not maximize diagnostic performance. An alternative is to combine genes from potentially unrelated pathways to generate a more informative classifier.

We present evidence from a large observational cohort of Emergency Department patients that host responses to bacterial, viral, or noninfectious insults are unique and quantifiable. Therefore, the objective of this study is to show that the host response, as measured by peripheral blood gene expression changes, can accurately differentiate viral ARI, bacterial ARI, and noninfectious illness as an important step toward their routine use in clinical practice. Such an approach offers new opportunities to guide appropriate antibiotic use and combat emerging antibiotic resistance.


Bacterial ARI, viral ARI, and noninfectious illness classifiers

In generating host gene expression–based classifiers that distinguish between clinical states, all relevant clinical phenotypes should be represented during the model training process. This imparts specificity, allowing the model to be applied to these included clinical groups but not to clinical phenotypes that were absent from model training (14). The target population for an ARI diagnostic not only includes patients with viral and bacterial etiologies but must also distinguish from the alternative those without bacterial or viral ARI. Historically, healthy individuals have served as the uninfected control group. However, this fails to consider how patients with noninfectious illness, which can present with similar clinical symptoms, would be classified, serving as a potential source of diagnostic error. To our knowledge, no ARI gene expression–based classifier has included ill, uninfected controls in its derivation. We therefore enrolled a large, heterogeneous population of patients at initial clinical presentation with community-onset viral ARI (n = 115), bacterial ARI (n = 70), or noninfectious illness (n = 88) (Table 1 and table S1). We also included a healthy adult control cohort (n = 44) to define the most appropriate control population for ARI classifier development. Clinical features of the subjects are summarized in table S2.

Table 1. Demographic information for the enrolled cohort as well as independent data sets used for external validation.

M, male; F, female; B, black; W, white; O, other/unknown. GSE numbers refer to National Center for Biotechnology Information (NCBI) Gene Expression Omnibus data sets. N/A, not available on the basis of published data.

View this table:

We first determined whether a gene expression classifier derived with healthy individuals as controls could accurately classify patients with noninfectious illness. Array data from patients with bacterial ARI, viral ARI, and healthy controls were used to generate gene expression classifiers for these conditions (Fig. 1). Leave-one-out cross-validation revealed highly accurate discrimination between bacterial ARI [area under the receiving operating characteristic curve (AUC), 0.96], viral ARI (AUC, 0.95), and healthy (AUC, 1.0) subjects for a combined accuracy of 90% (Fig. 2). However, when the classifier was applied to ill but uninfected patients, 48 of 88 were identified as bacterial, 35 of 88 as viral, and 5 of 88 as healthy. This highlighted that healthy individuals are a poor substitute for patients with noninfectious illness in the biomarker discovery process.

Fig. 1. Experimental flow.

A cohort of patients encompassing bacterial ARI, viral ARI, or noninfectious illness was used to develop classifiers of each condition. This combined ARI classifier was validated using leave-one-out cross-validation and compared to three published classifiers of bacterial versus viral infection. The combined ARI classifier was also externally validated in six publicly available data sets. In one experiment, healthy volunteers were included in the training set to determine their suitability as “no-infection” controls. All subsequent experiments were performed without the use of this healthy subject cohort.

Fig. 2. Evaluation of healthy adults as a no-infection control.

Classifiers of bacterial ARI, viral ARI, and no infection as represented by healthy controls were generated and applied using leave-one-out cross-validation. Each patient, represented along the horizontal axis, is assigned three distinct probabilities: bacterial ARI (black triangle), viral ARI (blue circle), and no infection (green square). The group on the right represents subjects with noninfectious illness.

Consequently, we rederived an ARI classifier using a noninfectious illness control rather than a healthy control. Specifically, array data from these three groups were used to generate three gene expression classifiers of host response to bacterial ARI, viral ARI, and noninfectious illness. Specifically, the bacterial ARI classifier was tasked with positively identifying those with bacterial ARI versus either viral ARI or noninfectious illnesses. The viral ARI classifier was tasked with positively identifying those with viral ARI versus bacterial ARI or noninfectious illnesses. The noninfectious illness classifier was not generated with the intention of positively identifying all noninfectious illnesses, which would require an adequate representation of all such cases. Rather, it was generated as an alternative category, so that patients without bacterial or viral ARI could be assigned accordingly. Moreover, we hypothesized that such ill but uninfected patients were more clinically relevant controls because healthy people are unlikely to be the target for such a classification task.

Six statistical strategies were used to generate these gene expression classifiers: linear support vector machines, supervised factor models, sparse multinomial logistic regression, elastic nets, k-nearest neighbor, and random forests. All performed similarly although sparse logistic regression required the fewest number of classifier genes and outperformed other strategies by a small but not significant margin (P > 0.05 using McNemar’s tests between leave-one-out cross-validated predictions from sparse logistic regression versus each alternative method). We also compared a strategy that generated three separate binary classifiers to a single multinomial classifier that would simultaneously assign a given subject to one of the three clinical categories. This latter approach required more genes and achieved an inferior accuracy. Consequently, we applied a sparse logistic regression model to define bacterial ARI, viral ARI, and noninfectious illness classifiers containing 71, 33, and 26 probe signatures, respectively. Probe and classifier weights are shown in table S3. Clinical decision-making is infrequently binary, requiring the simultaneous distinction of multiple diagnostic possibilities. We applied all three classifiers, collectively defined as the ARI classifier, using leave-one-out cross-validation to assign probabilities of bacterial ARI, viral ARI, and noninfectious illness (Fig. 3). These conditions are not mutually exclusive. For example, the presence of a bacterial ARI does not preclude a concurrent viral ARI or noninfectious disease. Moreover, the assigned probability represents the extent to which the patient’s gene expression response matches that condition’s canonical signature. Because each signature intentionally functions independently of the others, the probabilities are not expected to sum to 1. To simplify classification, the highest predicted probability determined class assignment. Overall classification accuracy was 87% (238 of 273 concordant with adjudicated phenotype). Bacterial ARI was identified in 83% (58 of 70) of patients and excluded 94% (179 of 191) without bacterial infection. Viral ARI was identified in 90% (104 of 115) and excluded in 92% (145 of 158) of cases. Using the noninfectious illness classifier, infection was excluded in 86% (76 of 88) of cases. Sensitivity analyses were performed for positive and negative predictive values for all three classifiers, given that prevalence can vary for numerous reasons including infection type, patient characteristics, or location (fig. S1).

Fig. 3. Leave-one-out cross-validation.

Classifiers of bacterial ARI, viral ARI, and no infection as represented by noninfectious illness were generated and applied using leave-one-out cross-validation. Each patient, represented along the horizontal axis, is assigned probabilities of having bacterial ARI (black triangle), viral ARI (blue circle), and noninfectious illness (red square). Patients clinically adjudicated as having bacterial ARI, viral ARI, or noninfectious illness are presented in the top, center, and bottom panels, respectively.

To determine whether there was any effect of age, we included it as a covariate in the classification scheme. This resulted in two additional correct classifications, likely due to the overrepresentation of young people in the viral ARI cohort. However, we observed no statistically significant differences between correctly and incorrectly classified subjects due to age (Wilcoxon rank sum, P = 0.17). Likewise, patients with viral ARI tended to be less ill, as demonstrated by the lower rate of hospitalization. We therefore used hospitalization as a marker of disease severity and assessed its effect on classification performance, which revealed no statistical difference (Fisher’s exact test, P = 1). As previously noted, the control cohort with SIRS included subjects with both respiratory and nonrespiratory etiologies. We assessed whether classification was statistically different in subjects with respiratory versus nonrespiratory SIRS and determined that it was not (Fisher’s exact test, P = 0.1305). Among the 47 subjects with respiratory SIRS, three were classified as having viral ARI and six classified as having bacterial ARI. Among the 41 subjects with nonrespiratory SIRS, one was classified as having viral ARI and two were classified as having bacterial ARI.

We compared this performance to procalcitonin, a widely used biomarker with some specificity for bacterial infection.(15) Procalcitonin concentrations were determined for the 238 subjects for whom samples were available and compared to ARI classifier performance for this subgroup. Procalcitonin concentrations >0.25 μg/liter assigned patients as having bacterial ARI, whereas values ≤0.25 μg/liter assigned patients as nonbacterial, which could be either viral ARI or noninfectious illness. Procalcitonin correctly classified 186 of 238 patients (78%), compared to 204 of 238 (86%) using the ARI classifier (P = 0.03 by McNemar’s test). However, the accuracy of the two strategies varied depending on the classification task. For example, performance was similar in discriminating viral from bacterial ARI. Procalcitonin correctly classified 136 of 155 (AUC, 0.89), compared to 140 of 155 for the ARI classifier (P = 0.65 using McNemar’s test). However, the ARI classifier was significantly better than procalcitonin in discriminating bacterial ARI from noninfectious illness [105 of 124 versus 79 of 124 (AUC, 0.72); P < 0.001] and discriminating bacterial ARI from all other etiologies including viral and noninfectious etiologies [215 of 238 versus 186 of 238 (AUC, 0.82); P = 0.02 by McNemar’s test].

We next compared the ARI classifier to three published gene expression classifiers of bacterial versus viral infection, each of which was derived without uninfected ill controls. These included a 35-probe classifier (Ramilo) derived from children with influenza or bacterial sepsis (10), a 33-probe classifier (Hu) derived from children with febrile viral illness or bacterial infection (13), and a 29-probe classifier (Parnell) derived from adult intensive care unit (ICU) patients with community-acquired pneumonia or influenza (11). We hypothesized that classifiers generated using only patients with viral or bacterial infection would perform poorly when applied to a clinically relevant population that included ill but uninfected patients. Specifically, when presented with an individual with neither a bacterial nor a viral infection, the previously published classifiers would be unable to accurately assign that individual to a third, alternative category. We therefore applied the derived as well as published classifiers to our 273-patient cohort. Discrimination between bacterial ARI, viral ARI, and noninfectious illness was better with the derived ARI classifier (McNemar’s test, P = 0.002 versus Ramilo; P = 0.0001 versus Parnell; and P = 0.08 versus Hu) (Table 2) (16, 17). This underscores the importance of deriving gene expression classifiers in a cohort representative of the intended use population, which in the case of ARI should include noninfectious illness (14).

Table 2. Performance characteristics of the derived ARI classifier.

A combination of the bacterial ARI, viral ARI, and noninfectious illness classifiers were validated using leave-one-out cross-validation in a population of bacterial ARI (n = 70), viral ARI (n = 115), or noninfectious illness (n = 88). Three published bacterial versus viral classifiers were identified and applied to this same population as comparators. Data are presented as number (%).

View this table:

Discordant classifications

To better understand ARI classifier performance, we individually reviewed the 35 discordant cases (table S4). Nine adjudicated bacterial infections were classified as viral and three as noninfectious illness. Four viral infections were classified as bacterial and seven as noninfectious. Eight noninfectious cases were classified as bacterial and four as viral. We did not observe a consistent pattern among discordant cases. However, notable examples included atypical bacterial infections. One patient with Mycoplasma pneumoniae infection based on serological conversion and one of three patients with Legionella pneumonia were classified as viral ARI. Of six patients with noninfectious illness due to autoimmune or inflammatory diseases, only one adjudicated as Still’s disease was classified as having bacterial infection.

External validation

Generating classifiers from high-dimensional gene expression data can result in overfitting. We therefore validated the ARI classifier in silico using gene expression data from 328 individuals, represented in five available data sets (GSE6269, GSE42026, GSE40396, GSE20346, and GSE42834). These were chosen because they included at least two relevant clinical groups, varying in age, geographic distribution, and illness severity (Table 3). Applying the ARI classifier to four data sets with bacterial and viral ARI, AUC ranged from 0.90 to 0.99 (figs. S2 to S5). Lastly, GSE42834 included patients with bacterial pneumonia (n = 19), lung cancer (n = 16), and sarcoidosis (n = 68). Overall classification accuracy was 96% (99 of 103) corresponding to an AUC of 0.99 (fig. S6). GSE42834 included five subjects with bacterial pneumonia before and after treatment. All demonstrated a treatment-dependent resolution of the bacterial response signature (fig. S7).

Table 3. External validation of the ARI classifier (combined bacterial ARI, viral ARI, and noninfectious classifiers).

Five Gene Expression Omnibus data sets were identified on the basis of the inclusion of at least two of the relevant clinical groups: viral ARI, bacterial ARI, and noninfectious illness (SIRS).

View this table:

A subgroup of patients with ARI will have both bacterial and viral pathogens identified, often termed coinfection. However, it is unclear how the host responds in such situations. Illness may be driven by the bacteria, the virus, both, or neither at different times in the patient’s clinical course. In an exploratory analysis to determine whether coinfection could be identified with these methods, we applied the bacterial and viral ARI classifiers to patients with bacterial and viral co-identification. GSE60244 included bacterial pneumonia (n = 22), viral respiratory tract infection (n = 71), and bacterial/viral co-identification (n = 25). The co-identification group was defined by the presence of both bacterial and viral pathogens without further information about the likelihood of bacterial or viral disease (18). We trained the ARI signatures in GSE60244 subjects with bacterial or viral infection and then validated in those with co-identification (Fig. 4). The host response signature was deemed positive above a probability threshold of 0.5. We observed all four possible categories. Six of 25 subjects had a positive bacterial signature, 14 of 25 had a viral response, 3 of 25 had positive bacterial and viral signatures, and 2 of 25 had neither. These results suggest that coinfection can be detected using the host response. Moreover, simply identifying bacterial and viral pathogens may not necessarily mean both are inducing a host response.

Fig. 4. Classifier performance in patients with coinfection defined by the identification of bacterial and viral pathogens.

Bacterial and viral ARI classifiers were trained on subjects with bacterial (n = 22) or viral (n = 71) infection (GSE60244). This same data set also included 25 subjects with bacterial/viral coinfection. Bacterial and viral classifier predictions were normalized to the same scale. Each subject receives two probabilities: that of a bacterial ARI host response and a viral ARI host response. A probability score of 0.5 or greater was considered positive. Subjects 1 to 6 have a bacterial host response. Subjects 7 to 9 have both bacterial and viral host responses. Subjects 10 to 23 have a viral host response. Subjects 24 to 25 do not have bacterial or viral host responses.

Biological pathways

The sparse logistic regression model that generated the classifiers penalizes selection of redundant (correlated) genes (for example, if from the same pathway) if there is no additive diagnostic value. Consequently, conventional gene enrichment pathway analysis is not appropriate to perform. Moreover, such conventional gene enrichment analyses have been described (8, 11, 13, 19, 20). Instead, a literature review was performed for all classifier genes (table S5). Overlap between bacterial, viral, and noninfectious illness classifiers is shown in fig. S8.

The viral classifier included known antiviral response categories such as interferon response, T cell signaling, and RNA processing. The viral classifier had the greatest representation of RNA processing pathways such as KPNB1, which is involved in nuclear transport and is co-opted by viruses for transport of viral proteins and genomes (21, 22). Its down-regulation suggests that it may play an antiviral role in the host response.

The bacterial classifier encompassed the greatest breadth of cellular processes, notably cell cycle regulation, cell growth, and differentiation. The bacterial classifier included genes important in T, B, and natural killer cell signaling. Unique to the bacterial classifier were genes involved in oxidative stress and fatty acid and amino acid metabolism, consistent with sepsis-related metabolic perturbations (23).


Acute respiratory illness accounted for 71 million outpatient visits to U.S. providers in 2007 (24). Existing diagnostics fall short in their ability to differentiate bacterial, viral, and noninfectious etiologies contributing to the inappropriate prescription of antibiotics in 73% of such cases (3). Created by President Obama in 2014, the Task Force for Combating Antibiotic-Resistant Bacteria has prioritized the development of new and next generation diagnostics (6). One strategy to accurately define the infecting pathogen class is to use host gene expression profiles. Using sparse logistic regression, we developed host gene expression profiles that accurately distinguished between bacterial and viral etiologies in patients with acute respiratory symptoms (external validation AUC, 0.90 to 0.99). Deriving the ARI classifier with a noninfectious illness control group imparted a high negative predictive value across a wide range of prevalence estimates. These encouraging metrics offer an opportunity to provide clinically actionable results, which can mitigate emerging antibiotic resistance.

Several studies made notable inroads in developing host response diagnostics for ARI. This includes response to respiratory viruses (7, 911, 13), bacterial etiologies in an ICU population (11, 25), and tuberculosis (2628). Many such studies define host response profiles compared to the healthy state, offering valuable insights into host biology (2931). However, these gene lists are suboptimal diagnostic targets because gene expression profiles should ideally be applied to similar populations from which they were derived (14). Because healthy individuals do not present with acute respiratory complaints, they should be excluded from host response diagnostic development.

Including patients with bacterial and viral infections (10, 11, 13) allows for the distinction between these two states but does not address how to classify noninfectious illness. This phenotype is important to include because patients present in an undifferentiated manner whereby infectious and noninfectious etiologies are possible. This was the rationale for our approach, which was derived from, and can therefore be applied to, an undifferentiated clinical population where such a test is in greatest need. The cohort used to generate this classifier was derived from the larger Community Acquired Pneumonia and Sepsis Outcome Diagnostic (CAPSOD) cohort, which includes patients with suspected sepsis of nonrespiratory etiology as well. However, we only focused on patients with sepsis due to respiratory tract infection, and therefore, we cannot assume that these results would apply to a more general sepsis population.

Here, we report three discrete host response classifiers: bacterial ARI, viral ARI, and noninfectious disease. However, the major clinical decision faced by clinicians is whether or not to prescribe antibacterials. A simpler diagnostic strategy might focus only on the probability of bacterial ARI. However, there is value in providing information about viral or noninfectious alternatives. For example, the confidence to withhold antibacterials in a patient with a low probability of bacterial ARI can be enhanced by a high probability of an alternative diagnosis. Second, a full diagnostic report could identify concurrent illness that a single classifier would miss. We observed this when validating in a population with bacterial/viral co-identification. Such patients are more commonly referred to as “coinfected.” To have infection, there must be a pathogen, a host, and a clinically impactful interaction between the two. Simply identifying bacterial and viral pathogens should not imply coinfection. Although we cannot know the true infection status in these 25 subjects with bacterial/viral co-identification, our host response classifiers suggest the existence of multiple host response states.

Discordant classifications may have arisen from errors in classification or clinical phenotyping. Errors in clinical phenotyping can arise from a failure to identify causative pathogens due to limitations in current microbiological diagnostics. Alternatively, some noninfectious disease processes may in fact be infection-related through mechanisms that have yet to be discovered. Discordant cases were not clearly explained by a unifying variable such as pathogen type, syndrome, disease severity, or patient characteristic. As such, the gene expression classifiers presented herein are likely affected by other factors including patient-specific variables (for example, treatment, comorbidity, and duration of illness), test-specific variables (for example, sample processing, assay conditions, and RNA quality and yield), or as of yet unidentified variables. These concerns are heightened when validating in publicly available data sets where little to no information is made available about how such clinical labels are assigned. In the absence of phenotyping standards, errors in clinical diagnosis will propagate into poor performance of any classifier.

This study is limited in its ability to generalize to other special populations such as neonates, chronic viral infections, and the severely immunocompromised. Some of these patients were included in our cohort but their numbers were not enough to draw definitive conclusions about classifier performance. In five patients (32), the host response to bacterial infection resolved with treatment. However, a larger cohort is needed to answer whether ARI classifier kinetics can be used for treatment response monitoring. Moreover, the magnitude of gene expression changes may offer prognostic utility. Although we found no statistically significant difference in classification performance when comparing respiratory to nonrespiratory SIRS, it is possible that a true difference exists that we were underpowered to detect. We have undertaken a large, prospective collection of patients with acute respiratory complaints to directly address all of these limitations (supported by the Antibacterial Resistance Leadership Group, National Institute of Allergy and Infectious Diseases UM1AI104681).

These results define the necessary content to improve ARI diagnostics in a clinically relevant population. However, the technical hurdle to transfer these targets to a reliable, timely, affordable, and accessible platform remains. Doing so will directly answer the call for new diagnostics to combat antibiotic-resistant bacteria, a national security and public health priority.


Study design

Studies were approved by relevant Institutional Review Boards and in accordance with the Declaration of Helsinki. All subjects or their legally authorized representatives provided written informed consent.

Patients with community-onset, suspected infection were enrolled in the Emergency Departments of Duke University Medical Center (Durham, NC), the Durham VA Medical Center (Durham, NC), or Henry Ford Hospital (Detroit, MI) as part of the CAPSOD study ( NCT00258869) (23, 29, 3335). Additional patients were enrolled through UNC Health Care Emergency Department (UNC; Chapel Hill, NC) as part of the Community Acquired Pneumonia and Sepsis Study. Patients were eligible if they had a known or suspected infection and if they exhibited two or more SIRS criteria (36). The objective of these prospective observational studies was to identify patients with suspected sepsis and collect clinical information and bank samples for future research use. Upon adjudication and subject selection (described below), banked samples were accessed and analyzed. ARI cases included patients with upper or lower respiratory tract symptoms, as adjudicated by emergency medicine (S.W.G. and E.B.Q.) or infectious diseases (E.L.T.) physicians. There are currently no accepted consensus criteria by which viral ARI or bacterial ARI can be defined. Here, we performed retrospective adjudications based on manual chart reviews performed at least 28 days after enrollment and before any gene expression–based categorization as described by Langley et al. and in the text below (23). Medical record information used to support adjudications included, but was not limited to, patient symptoms, physical examination findings, routine laboratory testing, and radiographic findings (when clinically indicated). To be categorized as having a viral or bacterial ARI, a subject must have had a compatible clinical syndrome and an identified, compatible pathogen. Seventy patients with microbiologically confirmed bacterial ARI were identified, including 4 with pharyngitis and 66 with pneumonia. Bacterial pharyngitis was adjudicated on the basis of patient-reported symptoms and examination such as tonsillar exudate or swelling, tender adenopathy, fever, and absence of cough along with the identification of group A Streptococcus, either by antigen detection or culture. Bacterial pneumonia was adjudicated on the basis of patient-reported symptoms and clinical evaluation such as productive cough, fever, leukocytosis/leukopenia, and typical radiographic infiltrates (for example, consolidation), along with the identification of bacterial pathogens known to cause pneumonia. Microbiological etiologies were determined using conventional culture of either blood or respiratory samples, urinary antigen testing (Streptococcus or Legionella), or serological testing (Mycoplasma). There were 115 patients with viral ARI, including 48 students at Duke University enrolled through the Defense Advanced Research Projects Agency (DARPA) Predicting Health and Disease study. Viral ARI was adjudicated on the basis of patient-reported symptoms such as upper respiratory complaints (for example, rhinorrhea, sneezing, postnasal drip, and sore throat), epidemiologic factors such as sick contacts, and clinical evaluation such as absence of radiological findings typical for bacterial infection. This was in conjunction with an identified viral etiology compatible with the clinical syndrome. Viral etiology testing was frequently performed as part of routine clinical care. Specimens were typically nasopharyngeal swabs or lower respiratory tract–derived. In addition, the ResPlex II version 2.0 viral polymerase chain reaction multiplex assay (Qiagen) augmented clinical testing for viral etiology identification. This panel detects influenza A and B, adenovirus B and E, parainfluenza 1 to 4, respiratory syncytial virus A and B, human metapneumovirus, human rhinovirus, coronavirus (229E, OC43, NL63, and HKU1), coxsackie/echo virus, and bocavirus. Upon adjudication, a subset of enrolled patients was determined to have noninfectious illness (n = 88) (table S1). The determination of “noninfectious illness” was made only when an alternative diagnosis was established and results of any routinely ordered microbiological testing failed to support an infectious etiology. Inflammatory markers were not routinely measured for clinical purposes, although we did measure procalcitonin concentrations for study purposes. However, because classification performance was compared to procalcitonin, this biomarker was intentionally excluded from the adjudication process. Through this adjudication process, subjects were assigned to one of five likelihoods of infection (23, 33): (i) definite infection with an identified etiologic agent; (ii) definite infection without an identified etiologic agent; (iii) indeterminate, infection possible; (iv) no evidence of infection without an identified noninfectious etiology; and (v) no evidence of infection with an alternative noninfectious etiology. Here, we focused exclusively on categories 1 and 5. Lastly, healthy controls (n = 44; median age, 30 years; range, 23 to 59) were enrolled as part of a study on the effect of aspirin on platelet function among healthy volunteers without symptoms, where gene expression analyses were performed on pre–aspirin challenge time points (37). The totality of information used to support these adjudications would not have been available to clinicians at the time of their evaluation.

Procalcitonin measurement

Concentrations were measured at different stages during the study, and as a result, different platforms were used on the basis of availability. Some serum measurements were made on a Roche Elecsys 2010 analyzer (Roche Diagnostics) by electrochemiluminescence immunoassay. Additional serum measurements were made using the miniVIDAS immunoassay (bioMérieux). When serum was unavailable, measurements were made by the Phadia Immunology Reference Laboratory in plasma-EDTA by immunofluorescence using the B·R·A·H·M·S PCT sensitive KRYPTOR (Thermo Fisher Scientific). Replicates were performed for some paired serum and plasma samples, revealing equivalence in concentrations. Therefore, all procalcitonin measurements were treated equivalently, regardless of testing platform.

Microarray generation

At initial clinical presentation, patients were enrolled and samples were collected for analysis. After adjudications were performed as described above, 317 subjects with clear clinical phenotypes were selected for gene expression analysis. Total RNA was extracted from human blood using the Qiagen PAXgene Blood RNA Kit according to the manufacturer’s protocol. RNA quantity and quality were assessed using the NanoDrop Spectrophotometer (Thermo Scientific) and Agilent 2100 Bioanalyzer, respectively. Microarrays were robust multiarray average–normalized. Hybridization and data collection were performed at Expression Analysis using the GeneChip Human Genome U133A 2.0 Array according to the Affymetrix Technical Manual.


The ARI classifier was validated using leave-one-out cross-validation in the same population from which it was derived. Independent, external validation occurred using publicly available human gene expression data sets from 328 individuals (GSE6269, GSE42026, GSE40396, GSE20346, and GSE42834). Data sets were chosen if they included at least two clinical groups (bacterial ARI, viral ARI, or noninfectious illness). We also used GSE60244 to specifically validate classifier performance in 25 subjects with bacterial/viral co-identification. To match probes across different microarray platforms, each ARI classifier probe was converted to gene symbols, which were used to identify corresponding target microarray probes. Batch differences across these independent data sets precluded the direct application of the ARI classifier. Consequently, the signatures in the ARI classifier were tuned to each data set to assess classification performance.

Statistical analysis

The transcriptomes of 317 subjects (273 ill patients and 44 healthy volunteers) were measured in two microarray batches with seven overlapping samples (GSE63990). Exploratory principal components analysis and hierarchical clustering revealed substantial batch differences. These were corrected by first estimating and removing probe-wise mean batch effects using a Bayesian fixed effects model. Next, we fitted a robust linear regression model with Huber loss function using seven overlapping samples, which was used to adjust the remaining expression values.

Sparse classification methods such as sparse logistic regression simultaneously perform classification and variable selection while reducing overfitting risk (38). Therefore, separate gene selection strategies such as univariate testing or sparse factor models are unnecessary. Here, a sparse logistic regression model was fitted independently to each of the binary tasks using the 40% of probes with the largest variance after batch correction (39). Specifically, we used a Least Absolute Shrinkage and Selection Operator–regularized generalized linear model with binomial likelihood with nested cross-validation to select for the regularization parameters. Scripts were written in MATLAB using the Glmnet toolbox ( and can be located at This generated bacterial ARI, viral ARI, and noninfectious illness classifiers. Provided that each binary classifier estimates class membership probabilities (for example, probability of bacterial versus either viral or noninfectious illness in the case of the bacterial ARI classifier), we can combine the three classifiers into a single decision model (termed the ARI classifier) by following a one-versus-all scheme whereby largest membership probability assigns class label (38, 40). Classification performance metrics included AUC for binary outcomes and confusion matrices for ternary outcomes.(41) Determinations of significance included Wilcoxon rank sum, Fisher’s exact test, and McNemar’s test with Yates correction. Corrections for multiple testing and significance cutoffs are noted in Results.


Fig. S1. Positive and negative predictive values for (A) bacterial and (B) viral ARI classification as a function of prevalence.

Fig. S2. Validation of bacterial and viral ARI classifiers in GSE6269.

Fig. S3. Validation of bacterial and viral ARI classifiers in GSE42026.

Fig. S4. Validation of bacterial and viral ARI classifiers in GSE40396.

Fig. S5. Validation of bacterial and viral ARI classifiers in GSE20346.

Fig. S6. Validation of bacterial ARI and noninfectious illness classifiers in GSE42834.

Fig. S7. Treatment effect on bacterial ARI classification.

Fig. S8. Venn diagram representing overlap in the bacterial ARI, viral ARI, and noninfectious illness classifiers.

Table S1. Etiological causes of illness for subjects with viral ARI, bacterial ARI, and noninfectious illness.

Table S2. Summary of clinical features for the derivation cohort.

Table S3. Probes selected for the bacterial ARI, viral ARI, and noninfectious illness classifiers.

Table S4. Subjects with discordant predictions compared to clinical assignments.

Table S5. Genes in the bacterial ARI, viral ARI, and noninfectious illness classifiers grouped by biologic process.


Acknowledgments: We acknowledge bioMérieux Inc. for providing the reagents used to measure procalcitonin concentrations. Funding: Supported by the U.S. DARPA through contracts N66001-07-C-2024 and N66001-09-C-2082 and by grants from the NIH (U01AI066569, P20RR016480, and HHSN266200400064C). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. E.L.T. was supported by a National Research Service Award training grant provided by the Agency for Healthcare Research and Quality. E.L.T. and M.T.M. were also supported by Award numbers 1IK2CX000530 and 1IK2CX000611, respectively, from the Clinical Science Research and Development Service of the VA Office of Research and Development. The views expressed in this article are those of the authors and do not necessarily represent the views of the Department of Veterans Affairs. V.G.F. was supported by Mid-Career Mentoring Award K24-AI093969 from the NIH. The data contained in this article have not previously been presented. Author contributions: E.L.T., M.N., T.B., M.T.M., A.K.Z., L.C., G.S.G., and C.W.W. helped conceive the study. All authors helped acquire, analyze, or interpret data. E.L.T., R.H., and E.R.K. drafted the manuscript, which was critically revised by all remaining authors. Statistical analysis was specifically performed by R.H., J.L., and L.C. Funding was obtained by C.B.C., E.P.R., A.K.Z., S.F.K., V.G.F., G.S.G., and C.W.W. All authors had full access to all data in this study. Competing interests: The authors declare that they have no competing interests. The following individuals report additional activities, but not as competing interests to this manuscript: C.W.W. served as a scientific consultant to bioMérieux, Becton Dickinson, and Verigene. He received research support from the NIH, DARPA, the Defense Threat Reduction Agency (DTRA), the Bill and Melinda Gates Foundation (BMGF), the Veterans Administration (VHA), the Centers for Disease Control and Prevention, Novartis Pharmaceuticals, Roche Molecular, bioMérieux, and Qiagen. S.F.K. served as a scientific advisor to Parabase Genomics Inc. and Edico Genomics Inc. He received research support from the NIH. E.L.T. has received research support from DARPA, DTRA, BMGF, VHA, and Novartis Pharmaceuticals and has served as a scientific consultant to Immunexpress. V.G.F. has grants from the NIH, MedImmune, Forest/Cerexa, Pfizer, Merck, Advanced Liquid Logics, Theravance, Novartis, and Cubist. He served as the Chair of the Merck scientific advisory board for the V710 Staphylococcus aureus vaccine. He has been a consultant for Pfizer, Novartis, Galderma, Novadigm, Durata, Debiopharm, Genentech, Achaogen, Affinium, Medicines Co., Cerexa, Tetraphase, Trius, MedImmune, Bayer, Theravance, Cubist, Basilea, Affinergy, and Contrafect. He also received royalties from UpToDate and has been paid for the development of educational presentations for Green Cross, Cubist, Cerexa, Durata, and Theravance. G.S.G. has consulted for U.S. Diagnostic Standards and has served on the Scientific Advisory Board for Pappas Ventures. He has received grants from the U.S. Defense Advanced Research Projects Agency, the Gates Foundation, and Novartis Vaccines and Diagnostics. G.S.G., E.L.T., V.G.F., and C.W.W. have a patent pending for host gene expression signatures of S. aureus and Escherichia coli infections. G.S.G., E.L.T., R.H., T.B., M.T.M., L.C., and C.W.W. have filed a patent for methods of identifying infectious disease and assays for identifying infectious disease, as well as for molecular predictors of fungal infection. R.J.L. and S.F.K. have a patent pending for sepsis prognosis biomarkers. Data and materials availability: Gene expression data generated in this study have been deposited in the NCBI Gene Expression Omnibus (GSE63990). This study also used gene expression data from existing data sets (GSE6269, GSE42026, GSE40396, GSE20346, GSE42834, and GSE60244).

Stay Connected to Science Translational Medicine

Navigate This Article