It took 20 months longer than planned, and a daunting statistical challenge remains. But Facebook is finally giving researchers access to a trove of data on how its users have shared information—and misinformation—on recent political events around the world.
The data being made available today consist of 38 million URLs relating to civic discourse that were shared publicly on Facebook between January 2017 and July 2019. They reveal such details as whether users considered a linked site to be fake news or hate speech, and whether a link was clicked on or liked. Facebook is also providing demographic information—age, gender, and location—about the people who shared, clicked on, or liked those links, as well as their political affinities.
In April 2018, Facebook announced that social scientists would soon have access to this shared-link data. But then its own data experts realized that making the data available could compromise the privacy of a significant portion of its 2 billion users.
To solve the problem, the company decided to apply a recently developed, mathematics-based method to ensure the anonymity of its users, called differential privacy (DP), before releasing the “shared links” data set. That work has now been done, and social scientists are hailing the results.
“It’s a huge step forward,” says Joshua Tucker, a professor of politics and Russian studies at New York University who is hoping to use the data to augment his studies on how politically charged news spreads across social media platforms. “This is much closer to what was promised in the [April 2018] announcement. It will allow us to do a lot of the research we had proposed, and some things that weren’t even in [that proposal].”
But the solution also presents social scientists with the challenge of coping with the distortions, or noise, that have been injected into the data through the use of differential privacy. Data managers have always tried to ensure privacy, but DP will require new approaches. In particular, it requires injecting more noise when individual cells become smaller.
But those smaller cells may also contain some important results. “So, we will need to come up with methods that convince us that the data are useful in answering the questions we have raised,” Tucker says.
Hurry up and wait
Stung by evidence that it had given political operatives unauthorized use of its data, Facebook officials announced in April 2018 that it would grant researchers full access to information about its users with no strings attached. That information had long been considered proprietary, and any publicly available research done on it was either conducted in-house or required preapproval from Facebook.
Gary King, a quantitative social scientist at Harvard University, and Nathaniel Persily, a law professor at Stanford University, quickly formed a nonprofit entity, Social Science One, that would host the data on its website and vet requests to access it. Several major charitable organizations chipped in $11 million to fund proposals from scientists who wanted to use the data, and the Social Science Research Council (SSRC), a nonprofit organization, agreed to manage the grantmaking process.
SSRC put out a call for proposals, and Tucker received one of a dozen grants awarded in that first round, for $50,000. Tucker, who is also an adviser to Social Science One, had recently found that Facebook users older than 65 were nearly seven times as likely to share misinformation in the runup to the 2016 U.S. elections as those in their 20s.
That project relied on traditional surveys of people who had agreed to share their online behavior. Tucker wanted to go further, linking publicly available data he had obtained from Reddit and Twitter to the nonpublic user data held by Facebook. But the data weren’t available.
“When Facebook originally agreed to make data available to academics through a structure we developed … and [CEO] Mark Zuckerberg testified about our idea before Congress, we thought this day would take about two months of work. It has taken twenty,” King and Persily write in a blog post today.
The two scholars believe there were good reasons for the delay. “Most of the last 20 months has involved negotiating with Facebook over their increasingly conservative views of privacy and the law,” they write, “[A]nd watching Facebook build an information security and data privacy infrastructure adequate to share data with academics.”
Facebook has spent $11 million and assigned more than 20 full-time staffers to the project, writes Chaya Nayak, who leads the company’s election research commission that is working with Social Science One. Nayak also does a bit of crowing: “This release delivers on the commitment we made in July 2018 to share a data set that enables researchers to study information and misinformation on Facebook, while also ensuring that we protect the privacy of our users.
The next step is up to researchers. The challenge is to figure out how to adapt traditional methods of analyzing large data sets, such as carrying out multiple regressions, on those protected by differential privacy.
“Censoring [certain values] and noise are the same as selection bias and measurement error bias—both serious statistical issues,” King and Persily write. “It makes no sense … to provide data to researchers, only to have researchers (and society at large) being misled and drawing the wrong conclusions about the effects of social media on elections and democracy.”
This month, King and graduate student Georgina Evans described how to carry out linear regression on differentially private data sets. Similarly, Facebook scientists have just posted a preprint with guidelines on creating such data sets,
Tucker says scientists need to be convinced that their analyses are correct before the community will embrace the new approach to privacy. “We need the opportunity to validate that the results with differential privacy are close to those from tables” derived using previous ways to safeguard privacy, he says. “It all comes down to building a sense of trust.”