When it comes to searching for scientific literature, Google Scholar has become a go-to resource for a growing number of researchers. The powerful academic search engine seems to comb through every academic study in existence. But figuring out exactly how many papers are covered by Google Scholar isn’t easy, recent research shows—in part because of the company’s secretive, tightlipped nature. And some scholars warn the service may be inflating citation counts, although that may not necessarily be a bad thing.
Figuring out how many documents are indexed in traditional bibliographic databases, such as Thomson Reuters’s Web of Science and Elsevier’s Scopus, is a piece of cake—a simple query is all it takes. Microsoft Academic Search is similarly transparent. Google Scholar, however, offers no such tools to bibliometric researchers, and the Web search giant has declined to publish the information.
To come up with a tally, bibliometricist Enrique Orduña-Malea of the Polytechnic University of Valencia in Spain and his colleagues used four different methods to estimate Google Scholar’s total number of documents. Although each method has distinct limitations, all but one yield similar results, the researchers report in a study posted to the arXiv preprint server earlier this year and updated this month. The number: 160 million indexed documents (plus or minus 10%), including journal articles, books, case law, and patents.
The study is very “thorough and creative work,” says bibliometricist Henk Moed, a visiting professor at Sapienza University of Rome and a former senior scientific adviser at Elsevier, who was not involved with the research.
By itself, however, the number doesn’t answer some other questions important to academics. One is: What proportion of all scholarly documents is covered by Google Scholar? A previous study by computer scientists Madian Khabsa and C. Lee Giles of Pennsylvania State University, University Park, which estimated the size of Google Scholar at 100 million documents, suggested that it covers about 88% of all scholarly documents accessible on the Web in English. “It's not complete, but a very good coverage,” Giles says.
Another puzzle is how to gauge the quality of Google Scholar’s citation statistics (the number of times a paper is cited by other authors). In general, the citation numbers on Google Scholar tend to be higher than those provided by other sources, Moed says. That’s apparently because databases such as Web of Science require their citation sources to be peer-reviewed and surpass a minimum impact factor, he says, whereas Google Scholar taps into a much broader range of sources.
That’s not necessarily bad, Orduña-Malea says, noting that such wide citing practices can help expand the range of papers that researchers read, beyond those published in elite journals. But a potential problem, Moed says, is that some people may think that “the higher the [citation] numbers, the better the database. … But this is not necessarily the case.”
Moed’s team is conducting a study that compares citations for the same paper provided by Google Scholar and Scopus. One goal is to discern the sources included in Google’s search engine, he says. So far, the percentage of peer-reviewed sources tapped by Google appears to vary drastically across disciplines, he says.
To take the mystery out of such research, both Moed and Orduña-Malea would like to see Google Scholar become more transparent. Indeed, Google’s silence on the size of its index made the authors of the arXiv study wonder “if the company really knows this figure.”
But one insider dismisses such musings. “It is of course not difficult to compute,” says Anurag Acharya, a co-founder of Google Scholar who leads its development. Still, he wouldn’t share any numbers with ScienceInsider. He did note, however, that index size is less relevant to search companies such as Google than to subscription-based databases, which can use size as a selling point.
One thing seems certain, researchers say: Google Scholar is continuing to expand its coverage of scholarly literature, which is already believed to be the largest among all academic search engines and databases. “Google Scholar, we think, is representing very well the science landscape,” Orduña-Malea says.