Research ArticleAntibiotics

A decision algorithm to promote outpatient antimicrobial stewardship for uncomplicated urinary tract infection

See allHide authors and affiliations

Science Translational Medicine  04 Nov 2020:
Vol. 12, Issue 568, eaay5067
DOI: 10.1126/scitranslmed.aay5067

Triaging antibiotic use

Use of broad-spectrum, second-line antibiotics in treating urinary tract infections (UTIs) is increasing, likely due to the prevalence of antibiotic resistance. Kanjilal et al. applied a machine learning approach calibrated to local hospital electronic health record data to predict the probability of resistance to first- and second-line antibiotic therapies for uncomplicated UTI. The algorithm then recommended the least broad-spectrum antibiotic to which a given isolate was predicted nonresistant. Use of the pipeline reduced both broad-spectrum and ineffective antibiotic prescription for UTI in the patient cohort relative to clinicians, suggesting the clinical potential of the approach.


Antibiotic resistance is a major cause of treatment failure and leads to increased use of broad-spectrum agents, which begets further resistance. This vicious cycle is epitomized by uncomplicated urinary tract infection (UTI), which affects one in two women during their life and is associated with increasing antibiotic resistance and high rates of prescription for broad-spectrum second-line agents. To address this, we developed machine learning models to predict antibiotic susceptibility using electronic health record data and built a decision algorithm for recommending the narrowest possible antibiotic to which a specimen is susceptible. When applied to a test cohort of 3629 patients presenting between 2014 and 2016, the algorithm achieved a 67% reduction in the use of second-line antibiotics relative to clinicians. At the same time, it reduced inappropriate antibiotic therapy, defined as the choice of a treatment to which a specimen is resistant, by 18% relative to clinicians. For specimens where clinicians chose a second-line drug but the algorithm chose a first-line drug, 92% (1066 of 1157) of decisions ended up being susceptible to the first-line drug. When clinicians chose an inappropriate first-line drug, the algorithm chose an appropriate first-line drug 47% (183 of 392) of the time. Our machine learning decision algorithm provides antibiotic stewardship for a common infectious syndrome by maximizing reductions in broad-spectrum antibiotic use while maintaining optimal treatment outcomes. Further work is necessary to improve generalizability by training models in more diverse populations.


Uncomplicated urinary tract infection (UTI) refers to bacterial infection of a structurally normal lower urinary tract in a healthy female. It is an extremely common diagnosis that affects more than one in two women in their lifetime (1) and accounts for more than 13 million outpatient and emergency room visits (2). It is the third most common indication for antibiotic treatment in the United States (3), resulting in 4.7 million prescriptions annually (4). Fluoroquinolone antibiotics such as ciprofloxacin and levofloxacin are the most commonly prescribed antibiotic class for uncomplicated UTI, despite being second-line agents (4). These treatment choices may reflect the impact of increasing antibiotic resistance (57), which could lead clinicians to choose broad-spectrum therapies to minimize the risk of treatment failure.

Reducing the unnecessary use of fluoroquinolones has been a target for antimicrobial stewardship programs due to the well-documented risks of serious adverse events that include secondary infection with Clostridioides difficile (8), selection of multidrug-resistant organisms (9), tendinopathies (10), and aortic dissection (11). National practice guidelines published by the Infectious Diseases Society of America (IDSA) (12) provide a treatment algorithm that avoids the use of fluoroquinolones for uncomplicated UTI, but adherence is low (1315), partly because they are designed to be broadly applicable across many populations, leaving the task of personalizing treatment decisions to the clinician (16, 17). Thus, a data-driven clinical decision support tool to identify candidates for the first-line therapies nitrofurantoin and trimethoprim-sulfamethoxazole (TMP-SMX) could greatly reduce harm to patients by avoiding exposure to fluoroquinolones while still maintaining optimal treatment outcomes.

Computer algorithms have been used for clinical decision support in the management of infectious diseases since the 1970s (18). Recently, machine learning has been used to predict antibiotic resistance in bloodstream infections (1921), in UTI (22), and from pathogen genomic data (23). These approaches can provide new insights into clinical phenomena (24, 25) but have not yet been widely adopted because of their difficulty integrating into clinical workflows, lack of interpretability, and an absence of evidence proving their generalizability and their utility in actual clinical settings.

Here, we use the syndrome of uncomplicated UTI to propose a solution to the challenge of antibiotic prescription in the era of resistance. We applied machine learning to data in the electronic health record (EHR) to predict the probability of antibiotic resistance to first- and second-line therapies. We then developed a decision algorithm that translates probabilities into recommendations designed to select the antibiotic of the narrowest possible spectrum while still achieving clinical cure and benchmarked its performance relative to clinicians and a best-case adaptation of the national practice guidelines. We structured our algorithm with the intention of its deployment as an interpretable and personalized decision support tool embedded in the EHR to provide robust antimicrobial stewardship for a common outpatient diagnosis. Future efforts will focus on increasing the diversity of our sample to ensure robust recommendations across diverse races/ethnicities and socioeconomic strata.


Design and evaluation of a machine learning decision algorithm

We conducted this study in three parts. The first part consisted of building machine learning models to predict the probability of nonsusceptibility to antibiotics used to treat uncomplicated UTI. Models were trained on data from n = 10,053 patients (n = 11,865 specimens) with uncomplicated UTI presenting between 1 January 2007 and 31 December 2013 to Massachusetts General Hospital and Brigham and Women’s Hospital. We then developed a decision algorithm that translated probabilities into susceptibility phenotypes and chose the treatment of the narrowest spectrum among those that were susceptible. Last, we retrospectively evaluated the performance of our algorithm versus clinicians on a test set consisting of n = 3629 patients (n = 3941 specimens) with uncomplicated UTI who presented to the same hospitals between 1 January 2014 and 31 December 2016. We also compared performance against an adaptation of the national practice guidelines designed to allow clinicians the use of second-line antibiotics in a given percentage of decisions. Figure 1 outlines the analytic protocol.

Fig. 1 Schematic of analytic protocol.

(A) We trained decision tree, logistic regression, and random forest models to predict nonsusceptibility to nitrofurantoin (NIT), TMP-SMX, ciprofloxacin (CIP), and levofloxacin (LVX). We selected the logistic regression models for use on our test cohort, which consisted of patients presenting between 2014 and 2016. (B) We set a false negativity rate and identified the corresponding probability threshold for a given antibiotic. Isolates with the predicted probabilities of nonsusceptibility (NS) greater than this value were categorized as NS, whereas those with probabilities below this threshold were categorized as “susceptible” (S). This was repeated for all four antibiotics to yield a set of probability thresholds that could be used to bin predicted probabilities into phenotypes for each specimen. We then chose the antibiotic of the narrowest spectrum among those considered susceptible and calculated our two primary outcomes. This process was repeated for 1331 sets of thresholds. The optimal threshold set was selected to meet a prespecified target of minimizing inappropriate antibiotic therapy to the greatest extent possible while not exceeding a second-line antibiotic usage rate of 10%. (C) We evaluated our algorithm by retraining our chosen prediction models from part A on the entire training cohort and then performing prediction on the test cohort. Treatment decisions were made using the optimal threshold set from part B, and the resulting primary outcomes were compared to the performance of clinicians and a best-case guideline-based policy.

Few patients with uncomplicated UTI have known risk factors for antibiotic resistance

Baseline characteristics for the training, test, and combined cohorts are shown in Table 1. Mean age was 34 years (SD, 10.9 years), and 64.2% of patients self-identified as white. Patients in the test set presented more frequently in the emergency room and had a higher prevalence of resistance to ciprofloxacin and levofloxacin. Test set patients were more likely to receive treatment with nitrofurantoin, reflecting a shift in prescribing patterns after the dissemination of the updated IDSA guidelines in 2010. Time trends for major features are located in fig. S1. The prevalence of resistance in our cohort to fluoroquinolones was lower than national estimates taken from a cross-sectional survey performed in 2012 (5.8% versus 10.6%) (5). Conversely, the prevalence of resistance to nitrofurantoin in our sample was higher (12.1% versus 3.3%). Among patients with antibiotic resistance, the majority had no observed risk factors for drug resistance (75.5%, n = 2747 specimens in the training set; 80.5%, n = 1000 specimens in the test set), defined as a prior resistant organism or antibiotic exposure in the previous 90 days.

Table 1 Demographics, location of specimen collection, and microbiologic and treatment characteristics for patients with uncomplicated UTI.

Patients in the training set presented to Massachusetts General Hospital and Brigham and Women’s Hospital between 2007 and 2013 and those in the test set presented between 2014 and 2016. P values are for differences between the training and test sets and were calculated using two-sample t tests for normally distributed variables and a nonparametric randomization test derived from random permutations of the dataset labels for nonnormally distributed variables.

View this table:

Accuracy of prediction models for antibiotic susceptibility is influenced by prior antibiotic exposure and prior antibiotic resistance

For each patient in our cohort, we constructed a feature vector containing demographics, microbiology, antibiotic exposures, comorbidities, procedures, and basic laboratory values. We additionally constructed two population-level features: colonization pressure, which we defined as the population-level prevalence of resistance in urine specimens to an antibiotic in the 90 days preceding specimen submission, and hospital-wide antibiotic consumption. Major features except for colonization pressure were summarized 7, 14, 30, 90, and 180 days before the date of collection for a urine specimen. Additional features were added to indicate the presence of antibiotic nonsusceptibility or antibiotic exposure at any previous point in time. We excluded any data that would not be present at the time of an empiric treatment decision. We then trained logistic regression, decision tree, and random forest models to predict the probability of nonsusceptibility, defined as the likelihood that an isolate would be called “intermediate” or “resistant” by the clinical microbiology laboratory, to first- and second-line treatments. The models were trained on patients presenting between 2007 and 2013 and tested on patients presenting between 2014 and 2016.

Logistic regression was selected for prediction of nonsusceptibility to all four antibiotics, based on validation performance and interpretability. In the held-out cohort, the area under the receiver operating characteristic curves (AUROCs) for nitrofurantoin and TMP-SMX were 0.56 [95% confidence interval (CI), 0.53 to 0.59] and 0.59 (95% CI, 0.57 to 0.62), respectively. For ciprofloxacin and levofloxacin, the AUROCs were identical at 0.64 (95% CI, 0.60 to 0.68). Limiting prediction to the subset of patients with prior antibiotic resistance or antibiotic exposure in the past 6 months improved all AUROCs but had the greatest impact for the fluoroquinolones (Table 2). Model hyperparameters, ROC curves, and calibration plots are located in table S1, fig. S2, and fig. S3, respectively.

Table 2 AUROCs for prediction of antibiotic nonsusceptibility in patients presenting with uncomplicated UTI between 2014 and 2016.

View this table:

A decision algorithm reduces use of second-line therapies relative to clinicians and matches a best-case implementation of the treatment guidelines

We next translated probabilities of nonsusceptibility into phenotypes that fed into a decision algorithm designed to select the narrowest possible effective antibiotic. We achieved this by setting a threshold above which probabilities were phenotypically classified as “nonsusceptible” and below which they were phenotypically classified as “susceptible.” For each specimen, a set of four distinct thresholds was used on each of the four treatment choices and the algorithm subsequently recommended the antibiotic of narrowest spectrum among those predicted to be susceptible as the optimal treatment, making no decision if the specimen was predicted to be nonsusceptible to all treatments. We then calculated our primary outcomes, which were the proportion of recommendations for second-line antibiotics and the proportion of recommendations that resulted in inappropriate antibiotic therapy, defined as the use of an antibiotic to which the organism has in vitro resistance. We repeated this analysis sequence for 1331 threshold sets in total and chose a final set that met a prespecified target of minimizing inappropriate antibiotic therapy while allowing second-line usage in 10% of decisions in the validation set, which represented a realistic lower bound for clinicians in real-world settings (Fig. 2).

Fig. 2 Threshold sensitivity analysis.

Primary outcomes for 1331 unique threshold sets. The final threshold set (indicated by the blue dot) had the lowest rate of inappropriate antibiotic therapy among the 1331 threshold sets that had less than 10% usage of ciprofloxacin and levofloxacin on the validation set. IAT, inappropriate antibiotic therapy.

Using this process, we determined that the optimal thresholds for achieving our prespecified target were predicted probabilities of nonsusceptibility of >13% for nitrofurantoin, >18% for TMP-SMX, >26% for ciprofloxacin, and >24% for levofloxacin (Fig. 3). A decision algorithm applied to the test data using these thresholds was able to make a recommendation in 99% of specimens and chose ciprofloxacin or levofloxacin for 11.0% (95% CI, 10.0 to 12.0%) of specimens. This was a 67% reduction in the selection of these antibiotics relative to clinicians (33.6%; 95% CI, 32.1 to 35.0%). Algorithm recommendations resulted in inappropriate antibiotic therapy in 9.8% (95% CI, 8.9 to 10.8%) of test specimens, an 18% reduction compared to clinicians (11.9%; 95% CI, 10.9 to 12.9%). The algorithm had similar rates of inappropriate antibiotic therapy to a best-case implementation of the national treatment guidelines (10.7%; 95% CI, 9.7 to 11.7%), where second-line usage was capped at 10% (Table 3).

Fig. 3 False susceptibility and nonsusceptibility rates.

Rates of false susceptibility and false nonsusceptibility for prediction models derived from the training data. Final thresholds used on the test data are indicated by black vertical lines. Isolates with probabilities of susceptibility below the threshold were categorized as susceptible, and those above were categorized as nonsusceptible.

Table 3 Comparison of primary outcomes for algorithm, clinicians, and best-case guideline-based policy in patients presenting with uncomplicated UTI between 2014 and 2016.

View this table:

The decision algorithm better differentiates nonsusceptibility to first-line therapies relative to clinicians

We sought to better understand the factors driving algorithmic and clinical decisions through a post hoc analysis of the test results. The algorithm was able to discern nonsusceptibility to first-line agents better than clinicians. This difference was driven by the proportion of inappropriate antibiotic therapy with nitrofurantoin and TMP-SMX by clinicians (11.1% and 19.1%, respectively) relative to the algorithm (9.6% and 14.3%, respectively; table S2). For cases where clinicians chose a second-line therapy but the algorithm chose a first-line agent, 92% (1066 of 1157) of decisions ended up being susceptible to the first-line agent. For cases where clinicians chose an inappropriate first-line therapy, the algorithm correctly chose the appropriate first-line agent 47% (183 of 392) of the time (Fig. 4). We performed a manual review of 18 randomly selected charts where the algorithm (but not the clinician) chose the proper first-line agent and found that 10 patients (56%) had no prior antibiotic resistance or exposure to first-line therapies, 1 patient (6%) had complicated UTI or pyelonephritis, and 2 patients (11%) had no clinical documentation. Using regularized logistic regression, we observed that the top 5 features predicting use of fluoroquinolones by clinicians were prior fluoroquinolone use, being of a white race, and being seen in the emergency room, suggesting that provider preferences rather than patient risk factors for resistance may be driving use. Being seen in an outpatient clinic and prior resistance to ciprofloxacin were negatively associated with fluoroquinolone prescription (table S3).

Fig. 4 Post hoc analysis of clinician versus algorithm therapy decisions and appropriateness in patients with uncomplicated UTI presenting between 2014 and 2016.

Appropriate therapy was defined as the choice of an empiric antibiotic that has in vitro activity against the pathogen, whereas inappropriate therapy was defined as the choice of an empiric antibiotic that has no in vitro activity against the pathogen.

Algorithm recommendations would be actionable in actual clinical practice

Because our model is not able to account for all factors used in treatment decisions, we anticipated that a percentage of recommendations would be ignored by clinicians because of contraindications. We sought to estimate the percentage and the reasons for contraindication through an additional manual review of 20 randomly selected charts. For the scenario where clinicians chose a second-line agent when the algorithm correctly recommended a first-line agent, 15 of 20 (75%) of recommendations were actionable and 3 of 20 (15%) were contraindicated because of suspicion of pyelonephritis or the presence of multiple infectious syndromes. The actionability of the algorithm could not be determined in 2 of 20 (10%) because of a lack of clinical documentation. On the basis of this chart review, we performed a conservative sensitivity analysis that assumed that only 75% of algorithm recommendations would be actionable. Second-line use in the test set was 47% lower for the algorithm than for clinicians (17.8% versus 33.6%, respectively), whereas the proportion of inappropriate antibiotic therapy was nearly equal at 11.3% versus 11.8% for the algorithm and clinicians, respectively.

Antibiotic susceptibility to first-line therapies is influenced by prior resistance and antibiotic exposures

Last, we characterized the types of features predictive of antibiotic nonsusceptibility in patients with uncomplicated UTI. We first grouped features into sets corresponding to risk factor domains known to be associated with resistance. We then inferred the importance of each feature group by estimating the decline in predictive performance when that set was left out of a regularized logistic regression model. Prior antibiotic resistance was the most important feature set predicting nonsusceptibility to nitrofurantoin and fluoroquinolones, whereas both prior antibiotic exposures and resistance were most important for predicting nonsusceptibility to TMP-SMX. None of the changes in AUROCs reached statistical significance (Fig. 5). A description of the 10 most important individual features for predicting nonsusceptibility to each antibiotic is located in table S4.

Fig. 5 Feature importance characterization.

Data points represent the AUROC of logistic regression models that left out the feature set indicated on the x axis. The red dotted lines represent the AUROC for the full model, which contains all feature sets. The blue dotted lines represent the 95% CI for the full model. Error bars represent 95% CIs for the individual models.


In this study of patients presenting with uncomplicated UTI, we show how data-driven prescription strategies can help resolve the tension between maintaining optimal patient outcomes and reducing broad-spectrum antibiotic use. Using only information passively collected in the EHR, our decision algorithm was able to reduce prescription of second-line agents by 67% while at the same time also reducing inappropriate antibiotic treatment by 18% relative to clinicians. The implementation of this algorithm as a point-of-care clinical decision support tool could be a valuable addition to outpatient antimicrobial stewardship programs.

Machine learning applied to observational health data was used to predict antibiotic resistance in a large cohort of Israeli patients with UTI (22). Unlike the present study, which sought to balance the two competing objectives of reducing inappropriate antibiotic therapy and reducing broad-spectrum antibiotic use, their algorithm had the single objective of reducing inappropriate therapy with antibiotics. Their study also included males, pregnant females, and the elderly, further precluding a direct comparison. They achieved a retrospective reduction in inappropriate antibiotic therapy by 30 to 40% relative to clinicians by always selecting the antibiotic with the highest probability of susceptibility. However, because broad-spectrum drugs have the lowest rates of resistance, the final model had a high rate of selection for fluoroquinolones. In contrast, our work focuses on using machine learning to fill an unmet need for antimicrobial stewardship. Therefore, our goal was to penalize the use of fluoroquinolones and evaluate model utility under conditions that mimic a real-world clinical scenario to the greatest extent possible. This motivated our use of strict inclusion criteria for uncomplicated UTI as it was essential to performing a fair evaluation that accounts for factors driving clinical decisions.

Across the United States, fluoroquinolones are prescribed in 42% of treatment decisions in uncomplicated UTI (4). This is a much higher proportion than what would be expected based on criteria set forth in the IDSA treatment guidelines. Although the guidelines are intended to be a tool to reduce unnecessary use of fluoroquinolones, adherence has been poor. A major reason is their lack of personalization to the patient (13, 15) and because there is substantial variability in tolerating treatment failure between physicians, regardless of prior risk of resistance (26). We noted that one in three patients in the test cohort was prescribed a fluoroquinolone and the primary drivers for this choice were presentation in the emergency room, white race, and prior treatment with fluoroquinolones. One possible explanation for this is a lower tolerance for the risk of inappropriate antibiotic therapy among clinicians practicing in certain clinics or encountering specific patient populations. Our algorithm achieved antimicrobial stewardship targets that would be expected under a best-case scenario where guideline adherence leads to a second-line usage of only 10%. Unlike the guidelines, it is able to do this in a manner that provides interpretable recommendations derived from models trained on data from the local population. Further research is necessary to determine whether these aspects would be sufficient to influence prescribing practices in settings where guidelines have been ineffective and tolerance for treatment failure is low.

From model conception through execution and evaluation, our intent was to promote generalizability by mimicking the clinical context in which we envisioned that the algorithm would be deployed. This affected our analysis in three ways. The first was our choice to eschew more sophisticated modeling approaches in favor of model classes that have greater interpretability and computational tractability. Second, we elected to use a time-based train/test split despite the secular trends in our covariates because this would best recapitulate the implementation of our algorithm in clinical practice. Last, our thresholding-based method to translate continuous probabilities into categorical decisions represents a value judgment to minimize the use of second-line antibiotics at the cost of inappropriate therapy in a subset of patients. We believe that this approach can be easily translated to identify empiric treatments for other infectious syndromes such as hospital-acquired pneumonia and bloodstream infection by simply adjusting the trade-off between the two outcomes.

There remain several outstanding questions regarding generalizability. First, because our cohort consisted of mostly Caucasians, it is possible that predictions will be biased when applied to more diverse populations. We have tried to minimize this by using nationally adopted criteria for uncomplicated UTI. Second, given that antibiotic resistance is an important predictor, we expect that the model would predict nonsusceptibility more often in environments where the prevalence of antibiotic resistance is higher than what is seen in our training data. This is most pertinent for nitrofurantoin and TMP-SMX because our decision algorithm is heavily weighted to favor first-line agents. Increases in the risk of resistance to the fluoroquinolones may also indirectly and negatively affect model performance because it is likely that settings with a high prevalence of resistance to fluoroquinolones also have a high prevalence of resistance to first-line agents. The impact of antibiotic exposure and prior resistance on current antibiotic resistance is well established, but quantifying the impact of each has been challenging (27). In this study of healthy patients with uncomplicated UTI, confounding by indication is unlikely to be an issue and we have assessed temporality by using longitudinal data.

In 25% of decisions, algorithm recommendations were non-actionable because of contraindications that would be known to clinicians but not to the model. Even in a worst-case scenario where we ignore 25% of recommendations, the algorithm still maintained a rate of inappropriate antibiotic therapy that was no worse than clinicians and maintained a 47% reduction in second-line agent use. We anticipate that in clinical practice, the majority of nonactionable recommendations will be due to triggering of the decision support tool in patients with pyelonephritis and a minority due to allergies or antibiotic intolerance. Only 3% of patients are estimated to have an allergy or intolerance to TMP-SMX (28, 29), although that may be an underestimation given that documentation of allergies in medical charts is poor. We suggest that implementation of the algorithm should be accompanied by a means for clinicians to provide feedback on contraindications when rejecting recommendations, thereby providing a mechanism for reducing inappropriate deployment. Although not affecting predictive performance, future work could also incorporate urinalysis results to further restrict deployment for only those patients with pyuria.

In summary, we have developed a decision algorithm for reducing unnecessary broad-spectrum empiric treatment for patients with uncomplicated UTI. Further work is necessary to develop the algorithm into a clinical decision support tool that integrates seamlessly into clinical workflows and draws from a continually retrained machine learning model to provide interpretable recommendations with measures of uncertainty in real time. A randomized controlled trial designed to prove the efficacy of such a tool for antimicrobial stewardship is critical for adoption into routine practice.


Study design

This study was designed as a retrospective analysis of n = 13,682 patients (n = 15,806 specimens) with uncomplicated UTI who presented between 2007 and 2016 to the Massachusetts General Hospital and the Brigham and Women’s Hospital in Boston, MA. The cohort was split into a training dataset with n = 10,053 patients (n = 11,865 specimens) presenting between 1 January 2007 and 31 December 2013 and a test set consisting of n = 3629 patients (n = 3941 specimens) who presented between 1 January 2014 and 31 December 2016. Uncomplicated UTI was defined as infection in the lower urinary tract of a nonpregnant female between the ages of 18 and 55 years with no abnormalities of the genitourinary tract and no major comorbidities. All patients provided urine cultures with an organism burden sufficient to warrant antibiotic susceptibility testing (>50,000 colony-forming units/ml with at most two organisms present) and received empiric antibiotic treatment with one of the first-line agents, nitrofurantoin or TMP-SMX, or one of the second-line agents, ciprofloxacin or levofloxacin. The empiric treatment window was defined as 48 hours before and up to 24 hours after specimen collection. We excluded patients with pyelonephritis and did not predict for fosfomycin because only 3.4% of specimens underwent susceptibility testing. We excluded specimens that did not undergo susceptibility testing for all four target antibiotics and any specimens that were prescribed multiple antibiotics during the empiric treatment window because we would not be able to make a clean comparison between our algorithm’s recommendation and clinicians. We also excluded from the test set any specimens from patients who also submitted specimens in the training set (4.4% of all specimens) to prevent data leakage. The selection of patients is shown in fig. S4. This study was approved by the Institutional Review Board of Massachusetts General Hospital with a waived requirement for informed consent.

Description of model features

Our data were derived from the Boston Infectious Diseases Cohort, a database of 271,827 patients who submitted a specimen to the microbiology laboratories of Massachusetts General Hospital and Brigham and Women’s Hospital between 2000 and 2016. For our prediction models, we extracted patient-level microbiology, demographics, antibiotic exposures, comorbidities (30), procedures, and basic laboratory values. Microbiologic data included the hospital location of specimen collection and antibiotic susceptibility profiles. Breakpoints were applied to the raw susceptibility data as defined in the 27th edition of the M100 document published by the Clinical and Laboratory Standards Institute (31) to provide uniform interpretations over the course of the study. Specimens coming from the same body site and growing the same organism within a 14-day period were dropped. Antibiotic exposures, prior resistance, prior organism, laboratories, comorbidities, and prior hospitalizations were summarized over the 7, 14, 30, 90, and 180 days before the date of a microbiologic specimen. Antibiotic exposures and prior resistance were also summarized across all available patient history to capture the 35% of patients in our training set with a history of resistance or medication outside of the preceding 180 days. We did not have access to dose or duration of antibiotic therapy, urinalysis results, drug allergies, or data for patient encounters occurring outside of our two centers.

We incorporated two population-level features into our models. The first was an adaptation of colonization pressure (32), which we defined as the prevalence of resistance among urinary specimens to an antibiotic over a predefined service area in the previous 90 days. We calculated this metric for three hierarchies of service areas: (i) the ward (separate prevalence for each outpatient clinic or inpatient ward), (ii) the facility (separate prevalence for five categories: hospital, general inpatient ward, intensive care unit, emergency room, or outpatient), and (iii) overall (a single prevalence across both hospitals). The second incorporated feature was cumulative antibiotic usage across both hospitals normalized by total patient volume per quarter. A detailed description of features and the analytic protocol is located in table S5 and fig. S5.

Machine learning architecture

We trained logistic regression (LR), decision tree (DT), and random forest (RF) models to predict the probability that a specimen would be called nonsusceptible to nitrofurantoin, TMP-SMX, ciprofloxacin, or levofloxacin. The term nonsusceptible included both intermediate and resistant phenotypes. Models were trained on data from patients who submitted urine specimens between 1 January 2007 and 31 December 2013 (the training set). Hyperparameters for each combination of model class (LR, DT, and RF) and antibiotic were tuned by training on 70% of the training data and evaluating the AUROC on the remaining 30% of the training data, referred to as the “validation set.” For each model class, we chose the hyperparameter set that produced the highest mean AUROC, which was generated by averaging across five 70/30 splits of the training data. Using these hyperparameters, we then trained each of the three model classes on 20 new 70/30 splits of the training data and evaluated the mean AUROC on the validation set. LR performed best for nitrofurantoin and TMP-SMX. Although RF performed marginally better than LR for ciprofloxacin and levofloxacin, LR was chosen for all subsequent analyses for reasons of interpretability.

Decision algorithm

We next translated probabilities output by our predictive models into susceptibility phenotypes by performing a sensitivity analysis with various false-negative rates (FNRs) for each antibiotic. In this context, a false negative corresponds to falsely predicting susceptibility, also referred to as a “very major error.” We used only logistic regression and excluded decision trees and random forests based on their poor validation set performance, as well as their relative lack of interpretability.

Using the 20 70% splits of the training data noted above, we set an FNR value and identified the probability of nonsusceptibility that would result in that value. We then used this probability as a threshold and applied it to the corresponding validation dataset to bin probabilities into susceptible or nonsusceptible phenotypes. A “threshold set” was generated by repeating this process for each antibiotic. For each specimen, we next selected the antibiotic of narrowest spectrum (nitrofurantoin < TMP-SMX < ciprofloxacin < levofloxacin) among those that were considered susceptible as the final treatment recommendation. If no antibiotic was considered susceptible, the decision algorithm made no choice. Using this set of recommendations, we calculated our primary outcomes, the inappropriate antibiotic therapy and second-line antibiotic usage rates, in that particular validation dataset. For any specimens where the algorithm was unable to make a treatment recommendation, we defaulted to the decision made by the clinician at evaluation time.

Because of the extensive time it takes to evaluate the performance of each FNR combination and the high correlation between resistance to ciprofloxacin and levofloxacin, we constrained the search space to combinations in which both second-line antibiotics had the same FNR. In total, we calculated reductions in inappropriate antibiotic therapy and second-line usage over 11 FNR values (0.001, 0.015, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9), one model class (logistic regression), and three antibiotics (nitrofurantoin, TMP-SMX, and the fluoroquinolones combined), yielding 1331 different combinations. The optimal threshold set was defined to be the one that minimized the mean rate of inappropriate antibiotic therapy while not exceeding a mean second-line antibiotic usage rate of 10% on the validation set.

Retrospective evaluation

We estimated the performance of the decision algorithm on a held-out test set of patients, presenting with uncomplicated UTI between 2014 and 2016. We retrained our best-performing models with tuned hyperparameters on 100% of the training data, applied the optimal threshold set (identified through our sensitivity analysis on the training data) to recommend empiric antibiotic treatments, and then calculated primary outcomes over the entire test dataset. For specimens where the model was unable to make a recommendation, the evaluation defaulted to the choice of the clinician. As our primary benchmark, we compared algorithm performance in the test dataset to the empiric treatment decisions made by clinicians.

We also sought to compare our performance to a conservative interpretation of the IDSA guidelines that preferentially chose the first-line agent to which the patient did not have prior antibiotic exposure and resistance in the prior 90 days. It avoided TMP-SMX if local rates of resistance were ≥20% and chose fluoroquinolone only if the patient had exposure or resistance to both first-line agents. In our dataset, local rates of resistance to TMP-SMX exceeded 20% every year, leading the guidelines to always favor nitrofurantoin over TMP-SMX. This implementation of the guidelines yielded a 3.2% rate of broad-spectrum usage in our validation cohort, which we deemed to be an unrealistic benchmark compared to real-world antibiotic prescribing practices (3, 4). It also ignores drug allergies or intolerance and prior treatment failures. Thus, we adjusted guideline recommendations to use second-line agents 10% of the time because this represents a more realistic target for antimicrobial stewardship programs. This was done by defaulting to broad-spectrum therapy in 18% of decisions where the conservative guidelines, but not the clinicians, chose a first-line agent. This adjusted guideline-based policy represents a best-case scenario for implementation of the guidelines in actual clinical practice.

We identified factors driving decisions for clinicians through a post hoc analysis using regularized logistic regression and manual chart review. Models included all of the covariates in our prediction models and were fit using the entire test dataset.

Feature importance characterization

A secondary aim of our study was to identify the features predictive of nonsusceptibility. On the basis of the known risk factors for antibiotic resistance, we grouped features into four mutually exclusive sets: (i) prior antibiotic exposure, (ii) prior antibiotic resistance and organism, (iii) colonization pressure, and (iv) hospital-wide antibiotic consumption. To estimate the impact of a given feature set, we compared the predictive performance between a full model and one where that set was held out. All models were trained in the same fashion as the prediction models and all contained demographics, comorbidities, laboratory values, and hospital encounters.

Statistical analyses

P values for comparisons of patient characteristics between train and test sets were calculated using two-sample t tests when a Shapiro-Wilk test failed to reject normality of the test statistic at a significance level of 0.05. Otherwise, P values were computed using a nonparametric randomization test using 100,000 random permutations of the dataset labels (i.e., train versus test set) for each characteristic. We report P values for descriptive purposes but do not assess statistical significance of these comparisons. Means and SDs for training set AUROCs were obtained by averaging more than 5 70/30 splits for hyperparameter tuning and more than 20 70/30 splits for model selection. For each of the 1331 threshold sets, we calculated means for primary outcomes over the same 20 70/30 splits generated for the training step. For the retrospective evaluation, we calculated mean AUROCs using 1000 bootstrapped samples drawn with replacement from the test set. Each sample contained the same number of specimens as the full test set. The 95% CIs for the reported AUROCs were calculated as follows(AUROCb¯±z0.975*Stdev(AUROCb))where AUROCb¯ and Stdev(AUROCb) are the mean and SD of the AUROC across the 1000 bootstrapped samples, respectively, and z0.975 is the quantile function of the normal distribution, approximately equal to 1.96. The 95% CIs for our primary outcomes in the test set were calculated using a normal approximation to the binomial distribution, also known as a Wald interval. Given the value of a primary outcome p̂, the 95% CI was given by(p̂z0.975*p̂(1p̂)n,p̂+z0.975*p̂(1p̂)n)where n is the sample size of the entire test set. For 95% CIs where sample sizes were <20, we computed the Jeffreys interval. All analyses were performed using Python version 3.6 and R version 3.5.0 (R Project for Statistical Computing).


Fig. S1. Secular trends for specimen location and empiric treatment choices.

Fig. S2. Test set ROC curves for prediction of antibiotic susceptibility for patients presenting with uncomplicated UTI between 2014 and 2016.

Fig. S3. Calibration plots for model predictions on test dataset.

Fig. S4. Cohort selection.

Fig. S5. Detailed methods schematic.

Table S1. Final model hyperparameters.

Table S2. Detailed breakdown of primary outcomes.

Table S3. Top 10 features predicting clinician fluoroquinolone use.

Table S4. Top 10 features predicting nonsusceptibility to nitrofurantoin, TMP-SMX, ciprofloxacin, and levofloxacin.

Table S5. Description of the Boston Infectious Diseases Cohort.

Data file S1. Data for all graphs.


Acknowledgments: We would like to thank Y. Grad and M. Klompas for their close review of the manuscript. Funding: This work was supported by a Massachusetts General Hospital–Massachusetts Institute of Technology Grand Challenges Award (S.K., M.O., S.B., and H.Z.), a Harvard Catalyst CMeRIT grant (S.K.), and a National Science Foundation CAREER award (S.B. and D.S.). Author contributions: S.K. and D.S. conceived the study. S.K., M.O., S.B., and H.Z. were responsible for collecting and cleaning the data. M.O., S.B., and H.Z. built the database to store and retrieve the data and developed the machine learning models. S.K. wrote the manuscript with input from all authors. D.C.H. provided mentorship and clinical insight for the data analysis. All authors reviewed multiple versions of the manuscript. D.S. supervised the project. Competing interests: S.K. receives grant support from PhAST Diagnostics. D.S. is a paid consultant for ASAPP, Memorial Sloan Kettering Cancer Center, Curai, and GNS Healthcare. H.Z. is a part-time student researcher at Google Health. All other authors declare that they have no competing interests. Data and materials availability: All data associated with this study are in the paper or the Supplementary Materials. A deidentified version of the patient-level data used for this study is made available on PhysioNet ( The code necessary to reproduce the main analyses from this paper and a link to the deidentified patient data are available at

Stay Connected to Science Translational Medicine

Navigate This Article