Biobanks and Electronic Medical Records: Enabling Cost-Effective Research

See allHide authors and affiliations

Science Translational Medicine  30 Apr 2014:
Vol. 6, Issue 234, pp. 234cm3
DOI: 10.1126/scitranslmed.3008604


The use of electronic medical record data linked to biological specimens in health care settings is expected to enable cost-effective and rapid genomic analyses. Here, we present a model that highlights potential advantages for genomic discovery and describe the operational infrastructure that facilitated multiple simultaneous discovery efforts.

Traditional studies of drug efficacy and safety address the utility of a specific therapeutic intervention in a defined population. Such study designs present important challenges. Patient accrual can take months to years, and the potential exists for systematic exclusion of clinically complicated but relevant patient groups, such as the elderly, those with comorbid conditions, and those who routinely take multiple drugs. Patient cohorts can be inadequate in size for subgroup analysis, long-term follow-up is often not feasible, and results are limited to diseases for which the participants were originally assessed. Hypothesis-neutral cohorts such as the Framingham Heart Study and Multicenter AIDS Cohort Study (MACS) have overcome these challenges and provided the foundation for critical discoveries that continue to shape health care practice. However, large monetary, time, and infrastructure investments are required to establish and maintain these highly curated, large cohorts in which data collection is focused on hypotheses formulated at the outset.

An alternative to clinical studies with traditional patient cohorts has emerged in the last decade—the pairing of disease-agnostic biobank specimens with electronic medical records (EMRs). Here, we describe the Vanderbilt Electronic Systems for Pharmacogenomic Assessment (VESPA) Project—a large EMR- and biobank-based initiative for translational pharmacogenomic discoveries. We used data from BioVU, Vanderbilt University’s EMR-linked biorepository (which as of April 2014 contains more than 179,000 DNA samples) to perform a preliminary cost and time analysis for this approach and compared these costs and time investments with those of traditional cohort studies.


A key element to establishing an efficient and effective pipeline was the creation of an organizational structure to facilitate communication and management among research teams. Through VESPA, we developed strategies and methods for initiating, executing, and monitoring studies. Essential to this pipeline was the formation of teams for phenotyping and genetic data analysis. Phenotype teams were physician-led and composed of individuals with clinical and informatics expertise, including specific clinical domain content experts. These experts were responsible for cohort selection, algorithm development and refinement, and manual review when necessary. The genetic data–analysis team, which had expertise in laboratory techniques and genomics technologies, directed genotyping assays and interacted with each of the various phenotype; teams. Project managers participated in study design, managed both phenotype development and genotyping throughput, and tracked timelines and milestones; this management tier was crucial for promoting multiple, simultaneous studies at different stages of development or execution.

The phenotype pipeline consisted of five key components: selection of a study phenotype, study design and, phenotype-specific algorithm development, review, and implementation. Study hypotheses were divided into two categories: (i) validation studies—those that replicated the association of clinical outcomes (for example, drug-response phenotypes) with previously identified genomic variants—and (ii) discovery studies—genome-wide investigations that sought to identify new gene-phenotype associations. A total of 28 phenotypes were selected for study (table S1).

Development of phenotype algorithm. Recent efforts have examined the utility of algorithms for determining phenotypes from EMRs (13). We used two approaches to construct phenotype algorithms: (i) fully automated, through the use of phenotype-selection algorithms that achieved high precision, and (ii) semi-automated, using algorithms to select a set of cases for manual review (usually rarer phenotypes). Data sets required to identify cases and controls accurately for each phenotype varied, but most included three data types: ICD-9 codes, medication regimens, and medical test results. Ten of the phenotypes also required the use of advanced informatics methods, such as natural language processing, to extract information stored in unstructured clinical text.

Pharmacogenomic phenotypes, in particular, rely heavily on temporal relationships (for example, administration of simvastatin before or concurrent with the onset of muscle pain). For our phenotype algorithms, we used event-sequence analyses to establish temporal relationships between drugs and phenotypes, which is a substantial challenge in bioinformatics (4). Both our case and control algorithms excluded records that contained specific clinical comorbidities. Algorithms were quality checked for precision by team members and iteratively refined to achieve positive predictive values (PPV) > 90%. For automated algorithms failing to meet this threshold, manual review was coupled with algorithms to validate that the included cases were true positives (5). Although manual review can be time-consuming and impractical for large cohorts, it is warranted when phenotypes are rare, complex, or involve temporal components too difficult to define electronically.

Enabling overlap. A total of 11,639 subjects (Table 1) met phenotyping criteria for at least one of the 28 phenotypes investigated by the VESPA team. Cohorts included subjects with primarily drug-response phenotypes. Seven phenotypes were not explicitly designed as such but were intended to enable future investigation into potential drug-response phenotypes; for example, subjects exposed to immunosuppressant therapy after organ transplantation offer potential examination of a range of outcomes (drug levels, transplant rejection, lipid abnormalities, cancer, or infections). Across all phenotype cases and controls, 90% were reused as either a case or control for at least one other phenotype. This demonstrates the capability offered by EMR-based studies to reuse cases and controls across both rare and common phenotypes, each with different phenotyping processes. Two VESPA replication studies have established the validity of an EMR-based method for identifying pharmacogenomic associations, clopidogrel major adverse cardiac events, and warfarin stable-dose (5, 6).

Table 1
View this table:


We compared the estimated monetary cost and resources required to generate VESPA cohorts (excluding analysis) to cost estimates drawn from the analysis of data derived from the NIH RePORTER (7) for M-, R-, U-, P- and Z-type grants that directly supported discrete pharmacogenomics studies in humans. Our analysis (Table 2, legend) revealed striking savings with the multiplexed VESPA approach (Table 2 and Fig. 1). The VESPA experience resulted in 28 case-control sets with a median cost per study of $76,674 [interquartile range (IQR), $43,173 to $207,769] and a median cost per genotyped subject of $393 (IQR, $382–$465). This includes the cost to phenotype cases and controls (personnel resources required to develop algorithms, implement algorithms, extract data, review records, and manage the pipeline) as well as the cost to genotype the cohort (consumables, processing, and quality control).

Table 2
View this table:
Fig. 1 Time is money.

Comparison of traditional NIH-funded pharmacogenomic studies versus EMR/biobank studies (BioVU). (Left) Median cost of study per subject. (Right) Median length of study in years.


The median funding amount for pharmacogenomics-related NIH grants with defined cohort sizes (across their lifetimes) is $1,335,927, with a median cost per genotyped subject of $1419. Notably, the low median cost per VESPA study ($76,674) was enabled by the reuse of subjects as cases and controls across multiple studies; had studies been conducted in isolation with no overlap among cases and controls, the estimated median cost per study would have been $438,473. Further highlighting the efficiency of biobank studies, VESPA studies took a median of 3 months to identify subjects with the target phenotypes, whereas the NIH grants reviewed were awarded for a median period of 3 years. Indeed, traditional consented recruitment models, for example, for common cancers, can take up to 20 years to generate sufficient cohort sizes (8). VESPA studies did not sacrifice cohort size or power as a consequence of reduced cost; in fact, the median cohort size of VESPA phenotypes was 1123, which is almost twice that of NIH-funded pharmacogenomics studies, which had a median cohort size of 623. Compared with a median cost per subject per year of $478 in a traditional cohort study, the median cost per subject per year in a VESPA study was $96.


There are potential advantages of discovery efforts in an EMR environment, especially when coupled to large genomic resources. First, EMRs contain large patient populations without disease-based exclusions (8). As demonstrated by the EMRs and genomics (eMERGE) network—a U.S. national consortium of existing DNA biorepositories linked to EMRs—these data can be used to rapidly create large, inclusive patient cohorts that foster investigation of variability in physiological traits and disease susceptibility (911). Second, the EMR approach offers substantial efficiencies owing to the ability to examine multiple phenotypes by using a single cohort of genotyped samples, an idea first championed on a large scale by the Wellcome Trust Case Control Consortium (12). Third, biobanks enable access, not only to cases but also to large numbers of controls, potentially providing additional power when using a design based on multiple controls per case. Fourth, because EMR-based biobank research is coupled to data routinely obtained in clinical care, the efficiencies of reuse suggest that the approach will prove to be cost-effective. In addition, the increasing use of EMRs [incentivized by the U.S. Health Information Technology for Economic and Clinical Health (HITECH) Act] and the increasing number of EMR-linked biobanks worldwide offer cost-effective resources, not only for discovery but also for the replication of genomic associations across nations and ancestries.

BioVU, the Vanderbilt DNA databank, is an example of an EMR-linked biorepository and a component of eMERGE (13, 14). It is important to note that the total costs described here for the VESPA study are marginal costs—they do not include costs associated with the design, set-up, and building of BioVU or establishing and maintaining the clinical electronic medical record. Thus, the substantial cost savings we observed was facilitated by resources already in place. Development of BioVU, an evolving resource with longitudinal health information, was and is institutionally supported, including investment in EMRs and creation of de-identified images of the EMRs. We highlight the cost savings enabled by BioVU to demonstrate the considerable return on investment afforded by the development of an EMR-based biobank.

As we have demonstrated, EMR-based biobanks can be cost-effective tools for establishing disease or drug associations in a real-world community health care setting. We provide data here that an EMR-linked biobank model such as BioVU enables cost and time efficiency in multiple ways: (i) the use of biological samples that have already been collected and would otherwise be discarded; (ii) an economy of scale obtained by central processing of these samples; (iii) reuse of the same sample for multiple studies without incremental collection, extraction, or processing costs; (iv) centralized de-identification and phenotype annotation of the EMR; and (v) reuse of data, based on program requirements for redeposit of genetic data for all studies. This efficiency is reflected in the substantial cost savings over traditional methods and is further amplified by the ability to examine multiple phenotypes by using a single cohort of genotyped samples (12).

Growth in EMR adoption fostered by the HITECH Act provides the foundation to efficiently expand EMR-based research and is not limited to studies within a single medical center. As evidenced by the robust analyses enabled by the eMERGE network (1517), the utility of EMR-derived data linked to biological specimens is amplified by pooling analyses across networks, leading to an increase in sample sizes and minimization of biases (18). The eMERGE network has demonstrated successful sharing of more than 18 phenotype algorithms across sites, with a median of three external validations per algorithm. Performance on case and control algorithms for development-site evaluations were similar to external-site evaluations: Median case PPV was 97% for host evaluations, and median PPV for external site evaluations was nearly identical at 95.5%, establishing portability of electronic definitions regardless of the EMR system and interoperability (


Data reuse. When combining data from multiple studies in a redeposit design such as that of BioVU, a major challenge is the combining of genotyping data ascertained from different genotyping platforms. This presents challenges for genetic analyses, including the selection of variants for analysis and controlling for batch and platform effects. However, these challenges are not unlike those associated with large genome-wide association study (GWAS) meta-analyses (1820). Indeed, a key analytical approach for VESPA studies has been to use GWASs, similar to the approach of many traditional pharmacogenomic studies that rely on observational cohorts, subject enrollment, or randomized controlled trials.

Although the GWAS method has been highly successful in identifying new loci associated with disease susceptibility, it has also been criticized because the effect sizes of the identified loci are often small, and thus, very large cohorts are needed to identify and validate genomic variations. On the other hand, although GWAS for drug response traits is less well-explored, multiple studies support the hypothesis that genetic associations can be identified even with small cohort sizes (2123). Unlike most disease-susceptibility studies, the effect sizes in pharmacogenomics can be large enough to consider for implementation in clinical care. As such, biobanks may become a crucial tool for facilitating pharmacogenomics research. Although we primarily focus on drug-response phenotypes, the methods described here can be used for a wide range of EMR-derived phenotypes or even to inform phenome-wide analyses (24).

EMR biases. Despite their numerous benefits related to time and efficiency, EMR-linked biobank approaches have limitations (table S2). One fundamental limitation is the potential loss to follow-up or the absence of clinical information pertaining to a patient after a given point in time. In the specific case of BioVU, de-identification of all subjects formally eliminates the ability to recontact patients. Moreover, the data are collected as a result of a provider’s determination of need based on clinical relevance at the time and may include only those medical encounters within one given medical center. Thus, studies are limited to, and potentially biased by, data that are available in the EMRs. In addition, it can be challenging to accurately identify cases and controls, particularly for complex phenotypes, and exposure misclassification or selection effect can lead to bias in the estimation of an interaction effect (20, 25).

In our studies, cohorts were defined by an exposure to a medication, a procedure, or patient characteristics at an index point in time; determining cases and controls by temporally constrained definitions can limit cohort populations because of the inherent difficulties in establishing temporality and event sequence in EMR records (26). Moreover, EMR-based data do not inherently capture the cost of a procedure or clinical event. However, an EMR system could be expanded and linked to external data sources, including cost and systems-delivery data, enabling such studies and affording additional opportunities for linking to research-derived data.

Politics. The trend of reduced U.S. federal support for research (27) jeopardizes higher-priced scientific explorations, even those that have proven fruitful for science and health. The current funding climate, rising costs of health care R&D, and stricter payer requirements should make resource reuse increasingly important for advancing clinical and translational research as well as for reducing related health care costs.

The financial efficiencies we observed for the EMR approach make it a compelling complement to traditional cohort designs.




Author contributions

Table S1. Advantages and disadvantages of the EMR-based biobank approach.

Table S2. Summary of phenotypes.

References (2840)


  1. Competing interests: The authors declare that they have no competing interests.
View Abstract

Stay Connected to Science Translational Medicine

Navigate This Article