CommentaryData Sharing

To Share or Not To Share: That Is Not the Question

See allHide authors and affiliations

Science Translational Medicine  19 Dec 2012:
Vol. 4, Issue 165, pp. 165cm15
DOI: 10.1126/scitranslmed.3004454

Abstract

There is an increasing awareness of the power of integrating multiple sources of data to accelerate biomedical discoveries. Some even argue that it is unethical not to share data that could be used for the public good. However, the challenges involved in sharing clinical and biomedical data are seldom discussed. I briefly review some of these challenges and provide an overview of how they are being addressed by the scientific community.

THE DATA DELUGE

Just a few decades ago, the idea that health sciences researchers across the globe could share their latest data and knowledge electronically, within seconds, was just a futuristic vision. At that time, technology, computer models, and cultural differences were not sufficiently understood to advance data sharing in biomedical research. But this vision materialized when collaborations and broad sharing networks were formed during the Human Genome Project and in creating PubMed, for example. Today, there is little question that responsible data sharing is a necessity to advance science (1)—and also a moral obligation (2). Many patients are willing to share their data if they are properly consulted (3). The question is no longer whether or not to share data for research, but how to do it in a way that adds value to society and protects individual privacy and preferences.

Although much has been written about handling the data deluge in health care and biomedical science (4, 5), practical solutions are still maturing. The increasing adoption of electronic health records, integrated laboratory information systems, online social networks, and high-throughput technologies has spurred the interest of government, industry, and academia groups in “Big Data” (6). As their goals align, important questions emerge: Where will these data reside? How will they be organized? Who should have access? Who will pay for the infrastructure for storing, sharing, and analyzing data? It is not clear whether biomedical researchers and health-care providers can remain focused on their activities by outsourcing data storage, deidentification, annotation, curation, and distribution to reliable third parties that possess the appropriate expertise. Moving forward, it is desirable that these data be integrated so that they provide more value than if shared for only one purpose at a time.

CARE TO SHARE?

On the biomedical front, many researchers are faced with a common challenge. They carefully collect data and use them until results are published but do not have the means to properly maintain the data or software long term and/or prepare them for sharing with other scientists. Most research journals are not equipped to review and maintain large annotated databases or software, and small research groups may not have the resources to maintain data, metadata (information about the data that would help others reuse them), and software developed primarily for in-house use. Access to data generated elsewhere is also difficult. Although public repositories exist, many types of data are not represented, so there is no “home” for them.

Funding agencies and journals increasingly demand that researchers share their data and software with others to ensure reproducibility of results as well as to support new analyses, but the task of developing and maintaining infrastructure to accomplish this demand is hard, time-consuming, and risky because of its high cost and the difficulties in recruiting specialized personnel. The currently used model—peer-to-peer data exchange—becomes quickly intractable when there are a multitude of unknown requesters. Reproducing results of others is difficult because the same software environment needs to be constructed, which many times requires extensive installation and configuration of different versions of software components.

On the health-care front, there are intriguing parallels. Data collected in the process of care could prove extremely useful for quality improvement initiatives as well as clinical research beyond the source institutions (7). Besides addressing the lack of data standardization across different institutions, several steps need to be taken to make derivatives of these data available for others in a way that protects individual and institutional privacy and that ascertains data quality. Even when exchanged for care, electronic data from patient records require special protections and a corresponding policy framework that ensures proper consent and compliance with regulations that cut across institutional, state, federal, and international boundaries.

To responsibly exchange these data for research is even more daunting. The vision of a “learning health-care system” in which all these data can be used for quality improvement and for health services or patient-centered outcomes research is sensible, but enabling this system is not simple (8). Furthermore, it is not trivial to track data usage to address the public’s increasing interest in guarding their own records (9) and in understanding how data and specimens obtained for one purpose (for example, health care or a specific study) are being used for other purposes, such as secondary analyses in other studies (10). Because of the high stakes, not all health care organizations have bought into the idea of data sharing; other than the technical challenges, prerequisites such as a system of incentives and a clear business model have to be developed. As research becomes increasingly translational, it is important that these challenges start to be addressed in a systematic way.

Awareness of this data-sharing challenge has prompted different institutions, including the National Institutes of Health (NIH), the Institutes of Medicine (IOM), the Agency for Healthcare Research on Quality (AHRQ), and the Patient-Centered Outcomes Research Institute (PCORI) in the United States, as well as international agencies, to assemble experts to discuss current best practices and new models for sharing diverse data, such as whole genomes, images, and structured data items commonly found in electronic health records. For example, the NIH Working Group on Data and Informatics (http://acd.od.nih.gov/diwg.htm) has made important recommendations to the Advisory Committee to the NIH Director. To summarize their recommendations, data and metadata should be shared, incentives should be offered to those who share data, and investments in user training and infrastructure need to be coordinated to ensure efficient use of resources. On the training side, the number and size of programs to train informatics professionals and researchers needs to increase. On the infrastructure side, a backbone for data and software sharing needs to be implemented—for example, through a network of biomedical computing centers. Building this network in a rapidly evolving technological landscape will require the development of new models for data sharing.

STATE OF TECHNOLOGY

Although initial technical setup may be complex, different solutions currently exist that allow researchers to share health and biomedical research data that involves human subjects in a privacy-preserving manner. “Cloud” computing has presented new ways in which to build and deliver software, and cloud storage has become mainstream in the digital world (such as the Amazon Cloud Drive and Apple iCloud). Cloud-based initiatives are part of an architectural solution that allows researchers to outsource infrastructure and use resources “on demand.” This power to scale a computational resource comes at a cost: Economies of scale are achieved by having multiple users use the resources of the cloud, which increases the complexity of managing the security and confidentiality of the data. In general, the requirements for human-subject data protection are not completely resolved by commercial, public cloud providers (11). To handle protected health information, these entities would need to sign business associate agreements with the data-contributing institutions; some cloud providers are not yet ready for this responsibility. Human genomes contain biometric information and hence can be considered protected health information, so this creates a problem when using public clouds.

Fortunately, many privacy technology algorithms and policy frameworks are being developed. Policies and technology can protect privacy in the cloud, particularly for specialized solutions that can be implemented in “private” and certain “community” clouds. These research clouds hosting protected health information must have a strong emphasis on privacy protection in which the advantages of elastic computing (the provision of on-demand computational resources for a large number of users) should still hold, but the environments handling protected health information are segregated, responsibilities are clearly spelled out, and additional access and quality assurance mechanisms are implemented.

Different models for data sharing in a research community cloud are currently being investigated—for example, in the Johns Hopkins Institute for NanoBioTechnology (http://releases.jhu.edu/2012/11/06/collecting-cancer-data-in-the-cloud/) and in the iDASH National Center for Biomedical Computing (12). iDASH stands for “integrating data for analysis, anonymization, and sharing” and is one of six centers funded in 2010 by the roadmap initiative at the NIH. The focus of this initiative is on new models for data sharing that allow researchers and institutions to pass the responsibility of data sharing, computing, and storage of large amounts of protected health information to a third party. Through technology and policy innovation, centers such as these address different models of data sharing, as illustrated in Fig. 1 and briefly described below.

Fig. 1.

Data-sharing models. To avoid multiple pairwise agreements among institutions, a broker for data can be created. Data contributors specify their requirements for data access by users and sign a contributor data use agreement (DUA). Completing a quality assurance (QA) process is required for data, tool, and VM contributions. Data users also sign a DUA that complies with the requirements of the contributor, so that contributors do not have to negotiate every data-sharing engagement with different institutions. Three models of sharing are displayed. Model 1 is the traditional model, in which users download data for use in their local computers. In model 2, the remote desktop model, users connect to a center but access and analyze data within the center using existing, or their own, algorithms. Model 3 involves virtualization and distributed computation, in which users import software environments (virtual machines, or VMs) to analyze their data using their local computational infrastructures.

CREDIT: Y. HAMMOND/SCIENCE TRANSLATIONAL MEDICINE

Users download data. In this traditional model (shown as model 1 in Fig. 1), data-seekers identify relevant data sources in a distributed or centralized resource (such as a server) and download data to their local computers. However, as data become “big” (giga- to petabytes of data) and issues related to frequency of updates, available network bandwidth, and ascertainment of data provenance (if these data are further distributed) become more common, it is not always practical or desirable to have data downloaded to local computers. This model, although still highly prevalent in the scientific community, may not work ideally in the long term. The liabilities involved on the part of data donors and users are high, and gigabyte networks are still limited to certain institutions. Although deidentification and privacy-protection algorithms can mitigate the confidentiality problem (1317), once data are downloaded there is no way to track their use, and there is still some risk of reidentification (18).

More research in reidentification and quantification of risk for privacy breaches will help develop policies for this model of sharing, particularly if information about human genomes is going to be shared (19). In fact, the recent NIH Workshop “Establishing a Central Resource of Data from Genome Sequencing Projects” (http://www.genome.gov/27549169) recommended that “sequence/phenotype/exposure data sets [be] deposited in one or several central databases.” In addition to recommending a central location for such data, the meeting discussions stressed the development of governance methods and policies for central databases that support responsible access to individual data sets. The U.S. Presidential Bioethics Advisory Committee recently issued a report emphasizing the importance of protecting health information, particularly the data about an individual’s genome (20).

An important feature of centralizing data is the ability to keep harmonized collections for future use. Manipulation of data, such as harmonization across different data sets (preprocessing), may result in products that are as useful—if not more useful—than the original data. By having preprocessing executed on local computers with no easy mechanism to upload preprocessed data back to the collective data resource, the user downloads model usually only provides one-way resource sharing. Participants of the NIH Workshop recommended that harmonization of data retrospectively should be captured in the central databases. Despite some limitations, this data-sharing mechanism is well understood by researchers and institutions and is still practical for small, nonsensitive (“sanitized”) data that are not being requested with very high frequency.

Users access and analyze data remotely. In this model (shown as model 2 in Fig. 1), no data are downloaded, and they remain protected in centralized or distributed data sets. Users can perform analyses using preexisting software (located where data reside) or submit their own software. Given the need to protect privacy, the software undergoes a specialized quality assurance process so as to ensure that no data are leaked with the results of the computation. Although this model requires users to be connected to the Internet, liabilities are reduced in the case of lost or stolen computers. The environment is admittedly a little less flexible than the data download model but can be privacy-protected and offer computational resources that may not be available to the user of the data in his or her institution. It also offers auditing capabilities to the data-hosting center that are not possible when data are downloaded by the users.

A variety of operating systems, applications, and data sets are required to support this model because it is hard to predict what users will need. Besides protecting the data, this model is especially useful if the user is going to perform demanding operations on large, sensitive data sets, such as genome queries. For example, if the researcher wants to perform de novo assembly of a large genome such as that of an individual patient, but does not have the computational infrastructure, she could use this model to compute “in the cloud.”

Users import whole software environments. This model (shown as model 3 in Fig. 1) is similar to the remote access model above, except that instead of having users use external computational resources, the users download virtual machines (VMs) to compute in their own hardware with their local data. The VM import model can also enable distributed computation, with each party installing the same VM and contributing results of its local computation to a coordinating center. For example, we have shown that it is possible to create an accurate predictive model by exporting the computation to different centers and aggregating results only, without any individual patient data ever being transferred (21).

The advantage of this model is that the same software environment can be reproduced, and users do not need to spend time installing specific versions of operating systems and applications. Thus, results are more likely to be reproducible. This model is useful when data cannot be transferred outside of an institution, as is the case in several health-care organizations in the United States or when legislation prevents transmission of data from an international collaborator outside their country.

This model also enables the creation of a network of collaborating centers, even if institutional policies disallow sharing of data at the individual level. For example, a researcher who wants to build a prognostic model for patients with a particular disease but has limited data at her own institution may need data from several centers. She would like to use a multivariate model to adjust for potential confounders, but she is not able to access such patient-level data at different institutions. With this model, she can combine coefficients and covariance matrices calculated locally at each institution (using the same VM) and transmitted to a central node. Another example in which this model is beneficial is when genomic data need to stay at one institution, but the phenotype data for the same patient is hosted in a different institution—and neither is able to transmit patient-level data to the other. Some algorithms can be decomposed so that multivariate models can be constructed across these “vertically” separated data. This may be one of the most effective ways to deal with international collaborations in which legislation against physical placement of data outside of the country may currently prevent some data-sharing initiatives.

STATE OF POLICY

Although technical solutions to data sharing are complex and varied, they may not be as challenging as the solutions for policy issues (2225). The multiplicity of institutional policies, different types of consent, and different interpretations of what constitutes a small risk for reidentification point to the need for solutions that will largely be based on proper enforcement of well-designed policies and regulations. For example, there has been discussion on whether access to a community cloud resource could be granted depending on user “certifications” that would require training in responsible conduct of research, among other things. Simplification of data-use agreements (DUAs) could be codified in addition to state and federal requirements and enforced through a network composed of several data-hosting centers. For example, iDASH investigators have worked with legal council at the University of California to develop a simple system to facilitate data “donation” and data “utilization” by different parties through data contributor agreements and data user agreements. This way, there are no pairwise DUAs between institutions and those who want to access the data. Observing the terms of use specified by the data contributor agreement, iDASH becomes responsible for the distribution of the data. A DUA system covers some of the requirements of data-sharing models 1 and 2 described above.

Other items that need attention are the use of appropriate access controls, depending on the sensitivity of the data (such as two-factor authentication). Additionally, algorithms for deidentification, data obfuscation, and methods to evaluate the risk of reidentification incurred in the disclosure of “limited data sets” (data sets for which certain identifiers were removed) are also needed. With the whole genome constituting the ultimate identifier for an individual, special protection needs to be implemented when these kinds of data are linked to other sensitive information.

A SHARED FUTURE

The time has come to address the need to make better use of the avalanche of health care and biomedical research data that are being currently generated through both private and public funding. All of those involved in health sciences research have a keen interest in preventing and alleviating the burden of human disease. There are technical and policy solutions to support data sharing that respect individual and institutional privacy and at the same time provide a public good that can help accelerate research. Several models of data sharing exist. We are just beginning to understand the ecosystem of sharing and to build systems that support these models. The increasing engagement of the public with translational scientists who are at the forefront of the battle against disease is changing the way we collectively look at data sharing: It is not an option, it is a necessity. Turning data into a public good in a way that respects patient privacy will affect translational research and human health in unprecedented ways.

References and Notes

  1. Acknowledgments: I thank the iDASH team for making this work possible. Funding: iDASH is supported by NIH through the NIH Roadmap for Medical Research, grant U54HL108460. Research and development in data sharing is funded through Agency for Healthcare Research and Quality grant R01HS19913 and NIH grants UH2HL108785 and UL1TR000100 (L.O.-M.). Competing interests: I am the principal investigator of iDASH.
View Abstract

Navigate This Article