Technology Feature

Translating big data: The proteomics challenge

This special feature is brought to you by the Science/AAAS Custom Publishing Office

Getting the most out of protein-related information depends on teamwork among scientists around the world, and that involves sharing large datasets. Simply passing big data back and forth is not a problem, however—the main obstacle is sharing that data in a way that other scientists can use it. Building software that can interpret information from different experiments and equipment remains complicated; likewise, exploring and analyzing large datasets from proteomics experiments even from one lab requires software that is most often developed in-house.

The roughly 20,000 protein-encoding genes in humans should make at least 20,000 proteins. However, modifications create more proteins—maybe many more. As of April 4, 2018, the Human Proteome Map included 30,057 proteins. Combining so many molecules with analytical technology such as mass spectrometry (MS), which explores many fine details, creates “big data.” The masses of complex information being uncovered about proteins are often so large that they require teams of scientists just to work on one dataset. 

Yet despite the size and complexity of these datasets, sharing them is becoming expected. According to Joshua Coon, director of the NIH National Center for Quantitative Biology of Complex Systems at the University of Wisconsin–Madison, raw data from proteomics studies is usually available today in a database, or the authors will send it upon request. 

“That was not the case 10 years ago, but attitudes have changed,” he says. The proteomics community—and increasingly the scientific community in general—realizes that data transparency improves the level of trust among researchers, even with those in different fields. 

Simply passing big data back and forth is not a problem, however—the main obstacle is sharing that data in a way that other scientists can use it.

Struggles of sharing

It’s easier than ever to produce large amounts of protein-related data—but it’s not always easy to share that data in the most helpful way. “In a couple of days, a protein scientist can create a terabyte of data, which is hard to transfer or visualize,” says Gary Kruppa, vice president of business development for proteomics at Bruker Daltonics in Billerica, Massachusetts. “A month of data gets hard to even store.”

The challenge of sharing such copious amounts of data depends on the high number of possible approaches to doing so, and on the need to provide sufficient experimental and biological metadata. If a scientist just wants to share raw data from a proteomics experiment, along with a little background on what it represents and some results, “it’s very straightforward,” says Juan Antonio Vizcaíno, proteomics team leader at the European Bioinformatics Institute (EMBL-EBI) in Cambridge, United Kingdom.

The challenges start to mount as more information is being shared among more scientists. Just dumping information into a database, for example, isn’t enough. “Someone has to pay attention to ensure that the data is of sufficient quality [for other scientists] to be able to do something with it,” says Andreas Huhmer, global marketing director for proteomics solutions at Thermo Fisher Scientific in San Jose, California. Plus, data cannot be easily uploaded to a database and retrieved unless it’s in a standardized format. 

The method of analyzing this data also impacts the conclusions drawn from it. “There are countless ways to analyze proteomics data these days, leading to subjective interpretations of data,” explains Andrew Webb, acting head for the division of systems biology and personalized medicine at the Walter and Eliza Hall Institute of Medical Research in Parkville, Australia. 

Other experts agree that analyzing data still challenges proteomics scientists. “How we effectively and efficiently turn raw data into something meaningful—even in one lab—is the first big hurdle,” notes James Langridge, director of health sciences at Waters in Manchester, United Kingdom. 

Even when scientists agree on standardized formats for the data and ways to analyze it, more work lies ahead. For one thing, the data standards must be updated as needed. In addition, sharing even the biggest proteomics datasets will fall short. “To maximize the scientific knowledge that can be derived from proteomics datasets, that knowledge should be systematically integrated with its genomic counterparts—the genome and transcriptome,” says Henry Rodriguez, director of the Office of Cancer Clinical Proteomics Research at the U.S. National Cancer Institute (NCI) in Rockville, Maryland. “By integrating proteomics with genomics—proteogenomics, a multi-'omics approach—the amount of new biological knowledge that can be derived will be greater than the sum of each individual 'omics part.”

Science of scale

The value of sharing big datasets from proteomics arises from the results they can potentially deliver, such as improvements in health care. For example, says Rodriguez, “Pharma could benefit by better understanding disease and, therefore, developing more effective drugs.”

Likewise, proteomics tools can be combined with other tools, such as gene-editing technologies like CRISPR. “The ability to edit a biological system and look at the phenotype is really quite amazing,” says Langridge. Tweaking the system with gene-editing tools and then analyzing the results will help scientists unravel the functions of specific proteins. 

Some of today’s biggest opportunities for sharing come from databases developed for that very purpose. One is EMBL-EBI’s PRoteomics IDEntifications (PRIDE) database. It includes proteomics data from more than 50 countries and over 8,400 datasets, representing nearly 80,000 assays for acquiring proteomics data—all adding up to about 400 terabytes.

The Swiss Institute of Bioinformatics in Lausanne developed neXtProt, which is also a protein knowledge-base. It includes entries on more than 20,000 proteins and nearly 200,000 posttranslational modifications.

“The most famous protein knowledge-base is UniProt, which is focused on more than just human [proteins],” says Vizcaíno.

Databases like this can allow new kinds of science. “You can try to come up with ways to combine datasets produced by different labs, or look for more innovative ways to analyze the data,” Vizcaíno says. “Usually, the analysis of proteomics data is answering one set of questions, but there could be other ways that the data can be analyzed.” So, if someone comes up with a new way to explore existing data, the results could unveil new biological knowledge. 

There is far more data about proteins to be determined. As Huhmer points out, “There are about 15,000 known families of proteins.” Proteins within a family all have some structural similarity. According to Huhmer, scientists have investigated the structures of these families and measured about 4,500 of them directly with technology such as X-ray crystallography; they have determined another 4,500 structures through computer modeling (with high confidence in only about 1,000 of them); and they have no idea about the structure of the remaining 6,000 or so families. 

Advances in technology keep giving scientists more proteomics data to handle. For example, Huhmer mentions that multiplexing label-free approaches to MS could be used to generate 1 million data points a day. In addition, combining MS with structural techniques—such as cryogenic electron microscopy, which can determine a protein’s three-dimensional shape—could be used to “analyze some structures that are uncharacterized today,” he says. “So, the evolution of technology is revealing more information about protein structure and driving more research in that space.”

Even better, once a method reveals the structure of one member of a protein family, a computational approach can be used to unravel other protein structures in that family. “Then, the results grow exponentially,” Huhmer explains. In fact, computation plays a wide role in advancing proteomics data and how scientists share it.

The right combinations of technology and research groups make it even easier for scientists to share proteomics data and to collaborate on projects. For example, the Technical University of Munich (TUM), JPT Peptide Technologies (JPT) in Berlin, SAP in Walldorf, Germany, and Thermo Fisher Scientific created a consortium to help scientists translate proteomics data into advances in basic and medical research. The research data generated by this consortium will be freely available in an online database called ProteomeTools.

Computing the connections

The example of the ProteomeTools consortium makes it clear that scientists and organizations need to create new ways to work together and share large proteomics datasets. Of course, with so much data being generated and so many possible connections between experiments and results, scientists are more focused than ever on developing new computational tools.

“I think artificial intelligence, machine learning, and deep learning are exciting areas in technology that encourage scientists to share big data,” Rodriguez notes. “The reason is that [these technologies] require lots of data and, thus, nudge the research community to share big data in order to ensure their continued development.”

Although Rodriguez appreciates the potential of these tools for making the connections that can generate new hypotheses to investigate, he adds, “We must never forget that it’s not only about the technology. It’s how people interpret, scrutinize, challenge, and agree or disagree with the analysis.”

As Rodriguez explains, “These computational systems beckon the need for greater collaboration and for open data science that creates value in new ways.” As examples of such ongoing and extensive collaborations, he points out three: the NCI’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) program; the Applied Proteogenomics OrganizationaL Learning and Outcomes (APOLLO) network, which is a collaboration between NCI, the Department of Defense, and the Department of Veterans Affairs; and the International Cancer Proteogenome Consortium (ICPC), which “encourages data to be made available to the public through its ‘Data-Sharing Pledge,’” notes Rodriguez. 

To further the power of such collaborations, analytical platforms must include techniques that simplify data collection and sharing. Along these lines, Waters developed its SONAR, a data-independent method of acquiring tandem MS data. “The whole idea is that we can acquire proteomics data in a consistent fashion,” Langridge says. For every sample tested, this system acquires a quantitative measure of the peptides and proteins. As Langridge explains, “Instead of just identifying the protein, it gathers its abundance across different runs.” Moreover, the user makes no decisions before the run starts about the data to collect, because SONAR collects it all. “The challenge with targeted assays,” Langridge says, “is that you’re making decisions upfront about what you’re going to focus on, but you often don’t know if you’ll have off-target effects or another biochemical pathway involved.”

Gathering so much data and storing it in ways that scientists can share it and revisit it in the future should enhance the ongoing value of a dataset. “A lot of studies get published and never looked at again,” Kruppa notes, “and that data can’t be validated if it can’t be shared easily.” So, creating data-sharing tools allows both new and old results to be validated. “Plus, [these tools let you] analyze data from other scientists and do a more effective metacomparison of your study to others,” he explains.

Older datasets can also help scientists move ahead in developing tools. For example, a new analytical tool can be tested on existing datasets and adjusted if needed. “Lots of scientists are working on new tools for analytical techniques involving artificial intelligence,” Kruppa points out, “and these tools can be validated on old datasets as long as they can be easily shared.”

Whether data is easily shared or not depends on its format. So, Bruker developed its trapped ion mobility spectrometry time-of-flight mass spectroscopy (timsTOF Pro MS/MS) platform to create a format that is available to anyone. “This instrument will generate lots of data, and we need to make it easy to work with,” Kruppa says, adding that without this kind of data compatibility, even the most advanced computational tools will hit roadblocks when attempting to compare datasets.

Seeing what is shared

At this point, one thing is clear: Proteomics scientists do not lack data. Instead, most of these scientists would probably agree with Coon, who says, “We are drowning in data.”

Coon notes that the best results require collecting all the raw MS data from an experiment and processing it as a batch. “You want the first and last samples to be collected and analyzed in the same way,” he says. 

Getting that done, especially the analytical part, often requires scientists to make their own tools. For instance, Coon hired a data scientist for two years to build a visualization tool. With datasets that combine proteomics, lipidomics, and metabolomics data, this research team needed a way to analyze and organize projects. So, Coon and his colleagues integrated their data viewer in a website.

“We first did this with a yeast project to let people use the data,” Coon explains, “and now we create a site like that for every project.” So, instead of giving visitors an 8,000-column Excel spreadsheet, Coon’s viewer lets other scientists easily compare different samples. “They can make queries on the data very fast,” he adds.

Although Coon says he hasn’t seen many similar examples of his method, he found that it helps his team and other scientists extract useful biological information from a dataset, because they can interrogate it and compare samples and data points very quickly.

“Most labs that generate so much data have to figure out how to get from raw MS files to something useful, and they probably have their own tools,” Coon notes. “There hasn’t been a lot of industry leadership that everyone can use.” He adds, “People don’t value software as much as hardware.”

To move forward in sharing large proteomics datasets, however, both hardware and software must continue to improve. Plus, scientists must maintain data quality. While “size is usually what leaps out at the mention of big data,” Rodriguez says, “the content and quality of information that gets pulled out of big data is what I consider to be ‘big’ in terms of knowledge opportunities.”

Submit your new product press release/description or product literature information to new_products@aaas.org. Visit Science New Products for more information.

Newly offered instrumentation, apparatus, and laboratory materials of interest to researchers in all disciplines in academic, industrial, and governmental organizations are featured in this space. Emphasis is given to purpose, chief characteristics, and availability of products and materials. Endorsement by Science or AAAS of any products or materials mentioned is not implied. Additional information may be obtained from the manufacturer or supplier.

Search Jobs

Enter keywords, locations or job types to start searching for your new science career.