Group Seeks Global Protocol to Identify Big Data Sets

SINGAPORE—Efforts are about to get under way to create an identification system that would make data sets easier for researchers to locate and use. Meeting here this week, scientists discussed the benefits of tapping into the reams of information being collected daily on everything from global weather to patient vital statistics.

"It's like having books in a library without a Dewey decimal system," says IBM vice president Bernard Meyerson about the current situation, in which there is no systematic way for scientists to find out what data sets exist and how to access the information.

Meyerson is a member of the Board on Global Science and Technology of the National Research Council, an arm of the U.S. National Academies. The board was set up in 2009 to examine the implications of global scientific and technological advances for U.S. policy and to foster international collaborations. The group "chose big data as a place to start" because of widespread interest in the issue, says Ruth David, chair of the panel and president of Analytic Services Inc., a not-for-profit national security think tank based in Arlington, Virginia.

The symposium was jointly sponsored by the board and the Institute for Infocomm Research of Singapore's Agency for Science, Technology and Research and drew scientists from around the world. One outcome of the 4-day symposium on "Realizing the Value from Big Data" is a group assigned to work out the details of an identification system. Meyerson said the group is considering a short digital tag that would uniquely identify data sets and provide basic details on the information it represents as well as conditions of access and use. The key will be making the tag easy to use for those generating and using data sets, he said.

The group hopes to come up with a protocol within a year that researchers creating large data sets will voluntarily adopt. The group may also seek the endorsement of the Internet Engineering Task Force, which develops and promotes Internet standards. Other topics for future discussion include rating the reliability of data sets and developing ways to merge data recorded in different formats.