Open Citations
Make bibliographic citation data freely available and substantial
benefits will flow, says David Shotton, director of the Open Citations Corpus.
[This is the full text of the Nature Comment article
David Shotton (2013). Open citations. Nature, 502 (7471): 295-297. http://dx.doi.org/10.1038/502295a
to which David Shotton retains copyright.]
When Heather Piwowar set out in May last year to
investigate whether making research data publicly available increased the citation rates of articles1,
she never anticipated the difficulties. Piwowar, co-founder of Impact Story2, based in
Vancouver, Canada, was at the time a post-doc at Duke University, North
Carolina. Lacking institutional access to Scopus, Elsevier's database of
scholarly citations, she eventually
obtained access through a Research Worker agreement with Canada's National
Research Library after being fingerprinted to obtain a police clearance certificate because
she had lived in the United States. "It was just ridiculous - for Scopus data! I wasted days trying to access the citation
data required for my study," she told me.
She needed to analyse citation counts for ten thousand articles, but the
other major citation source, Thomson Reuter's Web of Science, did not at that
time support PubMed ID queries. She
explains: "Had there been open citation data, I could have written my own
script!"
Steven
Greenberg, a neurologist at Harvard Medical School in Boston, Massachusetts, had
a similar experience when he set about revealing how hypotheses can be converted
into ‘facts’ simply by repeated citation3. Greenberg had to manually construct and analyse a citation
network contained 242 papers, 675 citations, and 220,553 distinct citation
paths that were relevant to a particular hypothesis. Had those citation data been readily
accessible online, he would have been saved considerable effort.
Research
practice suffers because access to citation data is currently so
difficult. In this Open Access age, it is a scandal
that reference lists from journal articles, core elements of scholarly
communication that permit the attribution of credit and
integrate our independent research endeavours, are not readily and freely available for use by all scholars.
To rectify this, citation data now need to be recognized as a part of the Commons – those works that are freely and legally available for sharing – and placed in an open repository. To that end, since 2010 I have led a project, funded by two small grants totalling £132,000 (US$212,000) from Jisc (http://www.jisc.ac.uk), a UK information technology research and development funding organization, to establish and develop the Open Citations Corpus (OCC) (http://opencitations.net). The OCC is a fledgling repository for open scholarly citation data that is now seeking sustainable funding to become a cornerstone of the digital research infrastructure that supports the academic enterprise.
Closed shop
Although alternative
metrics of impact and esteem are being developed4, direct
citation remains
a keystone indicator of the significance of an output (as discussed by
Mark
Hahnel on page 298). Scholarly communication involves the flow of information and
ideas through the citation network, and analysis of changes in the
network over time can reveal patterns of communication between scholars and the
development and demise of academic disciplines. Such information is central to scholarly
endeavour. It is also fundamental to good decision making about research
investment and strategy, to facilitate
innovation, and to promote growth and prosperity, particularly in light of the
increasingly international nature of research collaborations5.
The most
authoritative sources of scholarly citation data are the Thomson Reuter Web of Science,
which grew from the Science Citation Index created by US scientist Eugene
Garfield in 1964 and which was originally published by the Institute for
Scientific Information (ISI); and its main commercial rival, Elsevier's Scopus,
released in 2004. Both have wide
coverage of the leading scholarly literature, but because neither is complete they are widely regarded as complementary6. For access to these two resources, UK
research universities each pay tens of thousands of pounds a year6, with equivalent sums being charged in other developed countries.
The exact values of these subscriptions are closely guarded industrial secrets,
and the university librarians who pay these fees are bound by confidentiality
agreements from disclosing them. This high cost severely disadvantages all
those who work outside such wealthy institutions, including most businesses and
the general public. The two other
significant sources of
citation information, also run by commercial companies but accessible without
subscriptions, are Google Scholar and Microsoft Academic Search, released in
2004 and 2009, respectively. Google Scholar's coverage is wider than that of
the others, because it includes books, theses, preprints, technical reports and
other non-peer-reviewed 'grey' literature.
All
these sources have license restrictions that prevent the re-publication of their
citation data. For this reason, bibliometrics papers are rarely permitted to publish the
data upon which their conclusions are based, hampering re-use, validation of
findings, and other advantages of open data.
Worse, the available citation data are not accurate! My own citation record differs considerably between Web of Science, Scopus, Google Scholar and Microsoft Academic Search. For example, a 2009 paper on semantic publishing7 that I co-authored currently has citation counts of 22, 37, 88 and 16, respectively, in these four databases. Which to trust? More worryingly, an earlier protein crystallography paper8 has three separate entries in Web of Science, with citation counts of 59, 19 and 0, respectively, for this single publication! In my view, this calls into question the reliability of the Thomson Reuters Impact Factor, which is based on such counts.
A solution
The Open Citations Corpus (OCC), as a new open repository of scholarly citation data made available under a Creative Commons CC0 1.0 public domain dedication, is attempting to improve matters. It aims to provide accurate citation data that others may freely build upon, enhance and reuse the data for any purpose, without restriction under copyright or database law. We will publish the bibliographic citations from scholarly journal articles as Linked Open Data, making citation links as easy to traverse as Web links.
We began building the OCC in mid 2010, and released our first version of
the OCC in mid 2011. This prototype provided open access to reference lists
from the 204,637 articles that then comprised the Open Access Subset of PubMed
Central (OA-PMC), containing 6,325,178 individual references to 3,373,961
unique papers. Despite its small size, this
corpus contains references to about 20% of all the biomedical literature
indexed in PubMed that was published between 1950 and 2010, including all
the highly cited papers in every biomedical field. Available at http://opencitations.net/, the OCC is structured to
enable the information to be easily
integrated with similar information from elsewhere — the data are Linked Open Data using the SPAR (Semantic Publishing and Referencing) Ontologies9 and the latest Semantic Web standards.
Other open citations resources also exist. The two main ones are CiteSeerX (http://citeseerx.ist.psu.edu/), which contains ~13,500,000 references from 1,230,150 articles primarily in computer science, and CitEc (Citations in Economics; http://citec.repec.org/) which contains 13,544,970 references from 545,641 documents. Together, these resources and the OCC have the references from ~1,980,000 articles, a mere 4% of the estimated 50 million articles that have been published.
We are currently revising the OCC data model, improving its hosting infrastructure, and expanding its coverage, both by updating the OA-PMC holdings, which have more than doubled since the initial ingest to 671,004 articles, and by ingesting citation data from the 879,431preprints in the arXiv preprint server, thus adding citations in mathematics and the 'hard' sciences to augment the initial biomedical coverage. Future work will include integration with CiteSeerX, harvesting dataset-to-article references from the Dryad Data Repository, and extracting references from the pre-digital 'legacy' literature that is poorly represented in other citation repositories. This applies particularly in fields in which such literature it is both well organized and of enduring value – notably astronomy and biodiversity and biological taxonomy.
Ideally, references will come directly from publishers at the time of article publication. Most publishers are sympathetic to the idea of putting article reference lists outside the journal subscription paywall, as they do for copyrighted abstracts. We already have agreement with several major journal publishers for the future routine harvesting of reference data. As well as the 'pure' Open Access publishers, the references of which are open by definition, the publishers of subscription-access journals include Nature Publishing Group, Oxford University Press, the American Association for the Advancement of Science (which publishes Science), Royal Society Publishing, Portland Press, MIT Press and Taylor & Francis, all of which will make references available either from some or from all of their journals, This represents a small but growing proportion of all the journal articles published in a year.
References will be harvested centrally from CrossRef, the organization that provides digital object identifiers for journal articles, to which these publishers already submit article reference lists for use by its CitedBy Linking service. However, publishers need to indicate their consent in the article metadata for these references to be made open (see http://blog.crossref.org/2016/06/distributing-references-via-crossref.html), because by default references are kept private. No other action is required; it is straightforward and free.
The long-term aim of the OCC is to host citation information for most of the world's scholarly literature, in the arts and humanities as well as the sciences. This will require a major curatorial effort and underpinning technical innovation, on the scale of PubMed, which is run by the US National Library of Medicine..
Open season
In an ideal
world, publishers would host their own bibliographic and citation data,
following the example of NPG (publishers of this journal) — the first and
currently only company to make such information available as Linked Open Data,
at data.nature.com.
But
these are separate benefits from the aggregation of such data into a single corpus. The OCC will provide integrated access to citation data from a
variety of sources, both inside and outside traditional scholarly publishing, with clear provenance data.
It will expose entity relationships, including article-to-article, article-to-database and database-to-article citations, and will reveal shared authorship and institutional membership, common funding, and semantic relationships between articles, where the data are available.
Once citation data are openly available, useful analytical services can be built over them, including faceted search and browse tools, recommendation and trend identification services, and timeline visualization. Some of these we have already developed in prototype. The OCC's usefulness for calculating citation metrics will, of course, increase in proportion to its expanding coverage.
There is one additional service that we envisage could be of particular benefit to authors and editors – an erroneous reference correction service. About 1% of references in published papers contain errors of varying severity, ranging from the trivial — for example, substitution of ‘beta amylase’ for ‘β-amylase’ in the reference title, or the omissions of accents in author names — to the more serious, such as errors in the year, volume or page numbers. The OCC already uses citation correction methods internally for reference targets that are multiply cited, or for which authoritative bibliographic records can be obtained externally. A similar Web service that could detect errors in uploaded reference lists might significantly reduce the number of mistakes in published papers.
Help us!
So what next? Just over a decade ago, a similar aim for open citation data was held by the Open Citation (OpCit) Linking Project (http://opcit.eprints.org/), a collaboration between Southampton University, UK; Cornell University in Ithaca, New York; and arXiv, that ran between 1999 and 2002. That project developed a citation database called Citebase, a database of citation information, which its developers described as "the crown jewel of the Open Citation Project". Following the link to http://citebase.eprints.org/ today, one gets the message "No website currently exists at this URL."
Making the transition from a promising academic project to a robust
sustainable global service is extremely difficult. For the Open Citations Corpus to avoid the
fate of Citebase, and instead grow into a comprehensive
and trustworthy source of well-curated open citation data serving the entire
scholarly community across all disciplines, it requires champions, managers,
developers and curators. It also needs
genuine collaborations with similar endeavours, a sustained and sizeable income
stream from funders, supporters and investors committed to achieving a social
good rather than a financial return, direct support from the publishing
community, and adoption by a major institution or international
organization. Can you help? n
David Shotton is Director of the Open Citations Corpus and
Senior Research Fellow in the Oxford e-Research Centre, University of Oxford,
UK. e-mail: david.shotton@oerc.ox.ac.uk.
References
Piwowar, H. A. & Vision, T. J. PeerJ 1, e175 (2013).
http://dx.doi.org/10.7717/peerj.175
[2] Piwowar, H. Nature 493, 159 (2013). http://dx.doi.org/10.1038/493159a
[3] Greenberg, S. A. BMJ 339, b2680 (2009). http://dx.doi.org/10.1136/bmj.b2680
[4] Priem, J. Nature
495, 437–440 (2013). http://dx.doi.org/10.1038/495437a
[5] Adams, J. Nature 490, 335–336 (2012). http://dx.doi.org/10.1038/490335a
[6] Chadegani, A. A. et al. Asian Social Sci. 9, 18–26 (2013). https://arxiv.org/pdf/1305.0377.pdf
[7] Shotton, D. et al. PLoS Computational Biology 5: e1000361 (2009). http://dx.doi.org/10.1371/journal.pcbi.1000361
[8] Shotton, D. M. et al. Cold Spring Harbor Symposia on Quantitative Biology 36, 91-105
(1972). http://symposium.cshlp.org/content/36/91.short
[9] Peroni,
S. & Shotton, D. Web Semantics: Science, Services
and Agents on the World Wide Web. 17, 33-34
(2012). http://dx.doi.org/10.1016/j.websem.2012.08.001