There has been vigorous discussion in the scientific literature about the need and value of sharing full data sets from biomedical and clinical research, but it's rare to see the issue get headlines in the mainstream media. In August, an article in The New York Times put the spotlight on a $60 million clinical study of Alzheimer's disease because of its innovative approach to data management: Clinical and imaging data collected in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) were made available immediately for scientists to download and analyze.
"I firmly believe that openness and transparency is in the best interests of science. And it's in the best interest of scientific careers as well." -- Andrew Vickers
The data sets have been downloaded thousands of times, 160 papers using the data have been published so far, and 80 more are in the pipeline, Michael Weiner, principal investigator of ADNI, says in an interview with Science Careers. Making data transparent and available "so that other people can analyze the data and discover different things, [is] going to accelerate all of science," he says. "It's a relatively inexpensive way to get more value out of all of the work that we do."
However, ADNI's open clinical data-sharing policy is exceptional. "There has been a culture in biomedicine of not sharing data," says Andrew Vickers, associate attending research methodologist at Memorial Sloan-Kettering Cancer Center in New York City. "I think that culture has to change. And it's going to take young investigators to change it. I firmly believe that openness and transparency is in the best interests of science. And it's in the best interest of scientific careers as well."
This week, Science, Science Careers, Science Translational Medicine, and Science Signaling have joined forces to take a broad look at the challenges and opportunities researchers face in dealing with data.
This article is one of three in Science Careers on the topic. See also:
See the entire list of articles in all the Science publications at www.sciencemag.org/special/data/.
Some fields already have standard data-sharing practices, but not biomedicine. Guidance is particularly lacking when it comes to sharing data from clinical trials and pooled from electronic health records. This article presents expert advice, suggestions, and resources aimed at answering key questions about sharing clinical and biomedical data:
When designing your study, you should discuss these issues with your mentor and your institutional review board, and seek out your institution's and funding agency's specific rules and regulations.
Several funding agencies have policies that support data sharing and encourage investigators to make their data available. Some journals state that sharing data from studies is required. However, these policies usually don't have a penalty for not complying, "so in some sense they're voluntary," notes Heather Piwowar, who studies data sharing as a postdoc funded by the DataONE cyberinfrastructure project.
The U.S. National Institutes of Health (NIH) makes a broad statement of support about sharing data in its grants policy statement: "NIH endorses the sharing of final research data to serve these and other important scientific goals and expects and supports the timely release and sharing of final research data from NIH-supported studies for use by other researchers."
Investigators who apply for NIH grants of $500,000 or more must include a data-sharing plan with their grant application. These plans should indicate how data will be shared or explain why it cannot be shared. Grant review panels don't consider the data-sharing plan when evaluating an application, but once a grant has been funded investigators are expected to keep their data-sharing promises. "Data-sharing plans that are accepted become a term and condition of the award. The researchers can be held to their data-sharing plan," says J. P. Kim, director of the Division of Extramural Inventions and Technology Resources within the NIH Office of Extramural Research in Bethesda, Maryland.
For genetic association studies, the NIH requirements are stronger: NIH-funded investigators conducting "genome-wide analysis of genetic variation in a study population are expected to submit to the NIH genome-wide association studies (GWAS) data repository descriptive information about their studies for inclusion in an open access portion of the NIH GWAS data repository," the policy states. A frequently asked questions document about the policy says this also includes NIH-funded clinical trials that have a genetic association component. The NIH repository for GWAS data is dbGaP.
Journal policies on data sharing vary, but they typically urge authors to deposit specific types of data in their relevant repository. Here are some excerpts from Science's instructions for authors:
"Appropriate data sets (including microarray data, protein or DNA sequences, atomic coordinates or electron microscopy maps for macromolecular structures, and climate data) must be deposited in an approved database, and an accession number or a specific access address must be included in the published paper. We encourage compliance with MIBBI guidelines (Minimum Information for Biological and Biomedical Investigations). ...
Large data sets with no appropriate approved repository must be housed as supporting online material at Science, or only when this is not possible, on an archived institutional Web site, provided a copy of the data is held in escrow at Science to ensure availability to readers."
Another example fromCancer Research: "Authors of manuscripts with new nucleotide or amino acid sequences must deposit the sequence information with GenBank. ... Authors must submit the relevant accession numbers for deposited sequences with the manuscript and these will be published with the article."
Before you begin a study, check with your funding agency, institution, and target journals about their policies on sharing data -- and any possible restrictions on doing so.
A collection of links to NIH policies, guidance, and sample agreements for sharing data, biological materials, animal models, and so on is available at http://sharing.nih.gov.
Links to additional funding agencies' data-sharing policies can be found at BioSharing.
D. Field et al., " 'Omics Data Sharing." Science 326, 234 (2009).
H. A. Piwowar et al., "Towards a Data Sharing Culture: Recommendations for Leadership from Academic Health Centers." PLoS Medicine 5, e183 (2008).
H. A. Piwowar and W. W. Chapman, "Public sharing of research datasets: A pilot study of associations." Journal of Informetrics 4, 148 (2010).
Sharing data increases the transparency of the scientific process, says Weiner, who is director of the Center for Imaging of Neurodegenerative Diseases at the Veterans Affairs Medical Center in San Francisco, California. "Most data is collected by investigators. They write papers, they post papers, but the raw data and the data trail that leads to the papers is invisible." Access to raw data sets brings higher visibility to that data trail, and it allows the opportunity for scientific results to be independently tested and verified.
Weiner adds that the open-data policy in the ADNI study has meant that the data have been subjected to far more analyses than they would have if only a small collaboration was allowed to access it. "My colleagues and I are so busy [administrating the project] that sometimes we just don't have time to write the papers we think ought to be written, and other people are doing that," he says. "It's wonderful to see the data get analyzed."
Others cite the fact that the integrity of the research data may improve. "A robust regime of data sharing would make scientific misconduct a lot harder," says James Miller, an attorney and visiting scholar in the Department of Health Policy and Management at Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland.
Investigators who share their data have the satisfaction of contributing to those broad scientific advantages, but it can be difficult to see the advantages to them individually.
"It's probably the biggest question asked: 'What's in it for me?' " says Nicholas Anderson, assistant professor of biomedical health informatics at the University of Washington, Seattle. "What's often in it for them is collaborations, funding, and being more visible in the community by being more available."
For example, if you share or are willing to share a particular data set, a researcher who wants to study that data may invite you to collaborate on an analysis you wouldn't have pursued yourself. "I think we're seeing a lot more new investigators forming collaborations perhaps earlier than their senior peers because they have to," Anderson says. "They don't have the experience in informatics or regulatory or ethics or statistics, so they form affiliations and they really bootstrap things."
Sharing data may also increase how often your work is cited, particularly as standards for citing data take shape. Piwowar conducted a study that found that journal articles presenting cancer microarray clinical trials for which the investigators had made their data publicly available were cited about 70% more frequently than those from investigators who did not share their data. "There is evidence of citation benefit in some subdisciplines," Piwowar says. "I think that citation benefit will go up as we standardize on ways to cite data sets and as treating data sets as first-class entities becomes the norm."
H. A. Piwowar et al., "Sharing Detailed Research Data Is Associated with Increased Citation Rate." PLoS ONE 2, e308 (2007).
C. J. Savage and A. J. Vickers, "Empirical Study of Data Sharing by Authors Publishing in PLoS Journals." PLoS ONE 4, e7078 (2009).
A. J. Vickers, "Whose data set is it anyway? Sharing raw data from randomized trials." Trials 7, 15 (2006).
Caveman. "Send me all of your reagents and ideas. We want to work on the same experiments." Journal of Cell Science 114, 1037 (2001).
You should discuss your plans for sharing your data with your mentor and your institutional review board to address informed consent, patient privacy, and IRB oversight for your study. In addition, NIH maintains a collection of links and resources for extramural researchers on its Research Involving Human Subjects Web page. NIH addresses patient-protection issues in the online booklet Protecting Personal Health Information in Research. Below is some general information on the topic.
The Health Insurance Portability and Accountability Act (HIPAA) is designed to protect patients' personal health information. A patient may give informed consent for his or her health information to be used in a particular clinical study, but that applies only to the research question outlined in the informed consent document. That patient's clinical data cannot be used in the context of another study if the patient's identifying information (such as name, hospital record, or date of birth) remains linked to the patient's clinical data.
This puts limits on sharing data that contain protected health information. "Even in cases where HIPAA would not prevent the sharing of data, finding out whether or not it does is time consuming and complicated, so there's a tendency for some researchers to say, 'If there's any possibility that I could run afoul of HIPAA, I'm simply not going to share my data,' " Miller says.
However, once identifying information has been removed from data, the data are no longer subject to the rules of the privacy act, nor are they restricted by the terms of the original informed consent. "According to federal rule, de-identified data is not subject to IRB overview," Vickers says. "IRBs are there to protect patients, and this is not a patient-protection issue. I do advise people to speak to their IRBs just to confirm."
There are two ways to de-identify data, according to the parameters of HIPAA. First, you can remove 18 specific identifiers from the data record, which include things such as name; a geographic location smaller than a state; all dates related to the individual such as birth date, admission date, or date of death; social security number; and medical record number.
Or, if it's not possible to remove all identifiers, researchers can use statistical methods to mask the identifiers. "If for some reason you need date of birth, you can add jitter to it ... so it's still statistically valid," Vickers says. "If there's a date that's critical, maybe the date of surgery, you add a little bit of random noise to it." In that case, a qualified statistician must review the data and certify that the risk of identifying individual patients in it is very small.
"To de-identify 99% of data sets takes 5 minutes," Vickers says.
Vickers and colleagues provide further guidance on de-identification of data in their guidelines for preparing raw clinical data for publication.
Understanding Health Information Privacy. U.S. Department of Health and Human Services.
Protecting Personal Health Information in Research: Understanding the HIPAA Privacy Rule. U.S. National Institutes of Health.
Research Involving Human Subjects. Office of Extramural Research, U.S. National Institutes of Health.
I. Hrynaszkiewicz I and D. G. Altman, "Towards agreement on best practice for publishing raw clinical trial data." Trials 10, 17 (2009).
I. Hrynaszkiewicz et al., "Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers." Trials 11, 9 (2010).
B. Malin et al., "Technical and Policy Approaches to Balancing Patient Privacy and Data Sharing in Clinical and Translational Research." J Investig Med 58, 11 (2010).
J. D. Miller, "Sharing clinical research data in the United States under the health insurance portability and accountability act and the privacy rule." Trials 11, 112 (2010).
You should think about how you will manage and ultimately share your data from the earliest stages of designing your study. "Prospective design is critical," Anderson says. "I don't know how I can stress enough that [you need] early understanding of the knowledge structure and management of a clinical trial or a research experiment that has some alignment with both existing data, ownership of the data, and expectations for both analysis and sharing of it."
The NIH Web site sharing.nih.gov includes a sample data-sharing plan and key elements to consider for data sharing, both of which contain useful points to consider even if you're not applying for NIH funding. Some sample questions from the key elements document:
-What types of data are to be collected in the study and shared (such as genetic, physiological, or clinical)?
-What data documentation will be shared so that others can understand and use the dataset without misuse, misinterpretation, or confusion?
-Will a new repository need to be developed, and if so, who will maintain the repository?
-Will the data be distributed directly by an investigator to those who request it (e.g., through an electronic file)?
-What steps will be taken to help researchers know that the data sets exist?
These questions give some indication of the sorts of issues you should be grappling with when you start to design your study. The U.K.-based Wellcome Trust maintains similar documents for its grant applicants, a guidance on preparing data-sharing plans and a Q&A on data sharing, which may provide additional points to ponder when developing your own data-sharing plan.
Consult your own institution and funding agency about their specific data-sharing requirements. Or consult the BioSharing Web site, which maintains a list of several funding agencies' data-sharing policies.
As you think about how you will collect and manage your data, consider what reporting standards apply to your type of study and your specific field. Many subfields don't yet have such standards; this has long been a problem in clinical and biomedical research, and researchers in many subdisciplines are working to develop such standards.
Organizations are developing some global data standards such as those developed by the Clinical Data Interchange Standards Consortium. Also, data annotation standards have been developed for, for example, autism research, neuroscience, and cancer research. There are reporting standards for specific scientific techniques, such as the MIAME guidelines for microarray data. (A list of more standards is available from the BioSharing Web site.) Ensuring that your data conform to established standards will help ensure the utility of your data set to other researchers.
Vickers and colleagues have published guidelines for preparing raw clinical data for publication, which offer suggestions for nearly every step in the path, from data collection to publication, with an eye toward sharing the data with other researchers. "We realized that science is very, very heterogeneous and it's impossible to sit in a room and predict all the sorts of data types you could have," Vickers says. "What we said is that people should provide data and code and that the data and code should be sufficiently well annotated that a competent statistician could replicate the main results in the paper."
Researchers recognize that it's almost impossible to standardize certain types of data. But even if that's true of your data, you should make sure your data are available in a format that's useful to other investigators. "If a researcher makes the patient-level data available in a PDF format, those data are basically worthless," Miller says. "You have to make data available in a data set that people can download into their statistical package of choice."
Finally, you should consider how and where to post your data. Repositories exist for certain kinds of data, such as Proteome Commons for proteomics data and dbGaP for GWAS data. But there is no single repository for clinical and biomedical data -- which is as it should be, several experts interviewed for this article say. Dryad is a repository for data sets for peer-reviewed, published articles in basic and applied biosciences, and Sage Commons is for integrative genomics and disease modeling. Several interviewees noted the Dataverse Network Project, which can serve as a mechanism for managing data and sharing it, either by uploading data to the IQSS Dataverse Network or by downloading the Dataverse software and creating your own repository.
Small data sets can be published as supplements with the corresponding journal article. But many data sets are too large to post as supplementary data, and others still contain sensitive information about patients and so cannot be posted publicly. In addition, data supplements may not be a durable solution for sharing data: In a 2006 study, Anderson and colleagues looked at online data supplements accompanying a subset of articles indexed in PubMed and found that 17% to 29% were no longer available -- some as soon as 1 year after publication.
A lot of common data-sharing methods, such as putting data on a university department's Web server, have proven to be unsustainable. "I've seen so many grants that say, 'We're going to make it available on the faculty Web server,' " Anderson says. "You know that that probably won't remain true for long -- not for any nefarious reason, just because someone has to do it, that person may quit, or something might change."
That's why Kim and others recommend repositories for sharing data. "The best way to share is to put it in an appropriate repository because that way the data is automatically taken care of," says Kim. "The data would only be shared appropriately. It also alleviates the burden on the PI from having to fulfill data requests continuously."
Researchers need a broad understanding of informatics to deal with their study data. "I would recommend that [investigators] familiarize themselves with how information is beginning to be shared and structured, from discovery systems to outcome-data capture to common surveys to HIPAA, and what the constraints are, as early as possible, such that they can be more strategic about it," Anderson says.
But when you need specialized knowledge, you should reach out to experts and collaborate with them. "It's hard to do any of these trials on your own," Anderson says. "You're being reviewed by interdisciplinary teams, and you're competing with interdisciplinary teams. So you have to form interdisciplinary teams."
Guidance and policy for NIH grantees are available at http://sharing.nih.gov.
Key Elements to Consider in Preparing a Data Sharing Plan Under NIH Extramural Support, Office Of Extramural Research, National Institutes of Health. Accessed 8 February 2011.
Example Plan addressing Key Elements for a Data Sharing Plan under NIH Extramural Support, Office of Extramural Research, National Institutes of Health. Accessed 8 February 2011.
N. R. Anderson et al., "Issues in biomedical research data management and analysis: needs and barriers." J Am Med Inform Assoc. 14, 478 (2007).
N. R. Anderson et al., "On the persistence of supplementary resources in biomedical publications." BMC Bioinformatics 7, 260 (2006).
S. M. Fullerton et al., "Meeting the Governance Challenges of Next-Generation Biorepository Research." Sci Transl Med 2, 15cm3 (2010).
I. Hrynaszkiewicz et al., "Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers." Trials 11, 9 (2010).
Blogs and Web sites:
-Harvard University professor Gary King maintains a list of data sharing articles, publication, and policies of interest.
- Protecting Data Privacy in Health Services Research, Committee on the Role of Institutional Review Boards in Health Services Research Data Privacy Protection, Division of Health Care Services, Institute of Medicine, 2000.
-Improving Access to and Confidentiality of Research Data: Report of a Workshop, Committee on National Statistics, Commission on Behavioral and Social Sciences and Education, National Research Council, 2000.
-Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences, Board on Life Sciences, Division on Earth and Life Studies, National Research Council, 2003.
-Several articles referenced above are part of a series of articles in the journal Trials on sharing clinical data. Articles in the series are collected online at http://www.trialsjournal.com/series/sharing.
-A set of articles in Science Careers in May 2002 called "Sharing in the Sciences" addressed the topic of data sharing. Among them was the article "The Selfish Gene: Data Sharing and Withholding in Academic Genetics." by Eric Campbell and David Blumenthal, co-authors on the oft-cited paper, "Withholding Research Results in Academic Life Science." JAMA 277, 1224 (1997).
-F. LeClere, "Too Many Researchers Are Reluctant to Share Their Data." Chronicle of Higher Education, 3 August 2010. Accessed 6 February 2011.
-J. Kaiser, "Making Clinical Data Widely Available." Science 322, 217 (2008).
-A. Vickers, "Cancer Data? Sorry, Can’t Have It." The New York Times, 22 January 2008.
Kate Travis is the editor of CTSciNet, the Clinical and Translational Science Network, an online portal for career development in clinical and translational research produced by Science Careers.