Jim Pitman, Professor of Statistics and Mathematics, University of California, Berkeley, writes:

A recent letter to the IMS Bulletin Editor How a nonexistent publication can have 20 citations, by Czesław Stępniak, draws attention to the increasing importance of author identity in the online universe. The importance of correct author identification has been widely recognized: to allow authors due credit for their work, to assist researchers in navigating the vast universe of bibliographic data, and to facilitate collaborations between authors with similar or complementary interests.

ORCID is a non-profit organization formed in August 2010, with the aim of solving the author/contributor name ambiguity problem in scholarly communications by creating a central registry of unique identifiers for individual researchers, and an open and transparent linking mechanism between ORCID and other current author ID schemes.

The initiative to form ORCID came from major commercial interests in scientific publishing and information management: Thomson Reuters, Elsevier, and the Nature Publishing Company. To their credit, these publishers recognized that to create a widely adopted author identifier system it was essential to engage other participants in scholarly communication. The 14 directors of ORCID now include representatives from major universities and library organizations (Cornell, Harvard, MIT, OCLC), the Wellcome Trust (a funding organization with a strong commitment to open access), representatives of both commercial and non-commercial publishers. One notable participant is CERN, which in February 2010 published its entire book catalog under a CC0 License which dedicates the data to the public domain. As Jens Vigen, head of the CERN Library, said:

“Books should only be catalogued once. Currently the public purse pays for having the same book catalogued over and over again. Librarians should act as they preach: data sets created through public funding should be made freely available to anyone interested.”

The same should be said of records for articles, journals, authors, subjects, and so on, which constitute the bibliographic universe now available to scholars through a growing variety of academic information services. This is the idea behind the Principles for Open Bibliographic Datadeveloped by the Open Knowledge Foundation. See also the recent article Open Bibliography for Science, Technology, and Medicine.

The creation of ORCID signals a radical change in the governance of bibliographic data. Major publishers are collaborating with major universities and supporters of open access initiatives to develop an open indexing system for authors. Within a few years, such a system, spanning all fields, with institutional backing from both publishers and universities, will likely replace the patchwork of closed systems developed and maintained by subject-specific scholarly societies, such as the AMS MathSciNet Authors Database and ACM Author Profile Pages. These closed systems are unsustainable, and will come under strong pressure in the next few years to open up their data. No matter how carefully curated they may be, or how profitable to their maintaining societies, these databases artificially separate the communities they enclose from the larger scholarly universe inhabited by researchers, many of them statisticians, who work across a spectrum of mathematical science, computer science, and various domain sciences – biology, astronomy, and so on – all with their own indexing systems which are currently incapable of exchanging data with each other.

The breaking down of walls separating the bibliographic assets of various disciplines has been accelerated by the emergence of Google Scholar and Microsoft Academic Search. These semi-open academic search tools now compete with the major subscription indexing services in providing citation indexing across all fields. It is notable that Microsoft Research is a participant in the ORCID consortium, but Google is not. Google does not need to play with ORCID: Google has its own authentication mechanisms which it is now leveraging to create Google Scholar Citations. See for instance http://scholar.google.com/citations?user=cH0pbIwAAAAJfor my Google Scholar Citation Profile. The user value “cH0pbIwAAAAJ” in the above url is a unique author identifier for me which has been credibly established by Google through a verified email address at stat.berkeley.edu.

It will take some time to determine which agent or coalition of agents in the scholarly communication game will end up providing the most reliable and useful author identification system. But publishers and universities between them have all the information necessary to identify and track authors and their associated bibliographic data. So, too, do individual scholars together with either Google or Microsoft. It is in the interest of most individual scholars and universities, which have no stake in selling bibliographic metadata, and little interest in buying it, if it can be obtained for free, to give away their bibliographic metadata in machine-readable formats (JSON, XML, RDF, … ) now easily generated from raw bibliographic data. Such metadata serves as an advertisement for the real product: the research output of the individual or university expressed in the full text provided by publishers.

Collaboration between publishers and universities through ORCID, and trans-disciplinary indexing services such as Google Scholar and Microsoft Academic Search, should be expected to disrupt traditional subject-specific abstracting and indexing services. An emerging environment of open bibliographic services can be expected to pressure the traditional services to reinvent themselves as open data providers, doubtless with some additional component (reviews, annotations, ratings, data visualizations, etc.) to protect their subscription business model, which will not be given up easily. To stay afloat in a rising sea of open bibliographic data, traditional subscription services will need to facilitate the following scenario, which is still typically inhibited by excessive licensing restrictions. An individual, university department, or research organization subscribing to the service should be able to process and post feeds to the web of personal or subject bibliographic collections derived from the service, comparable to the way an author can now derive their Google Scholar Profile from the ocean of Google Scholar data. Such collections should be made available as structured data (including author and subject identifiers) with CC0 or similar license which permits remixing and reuse without permission. For instance, an author working in multiple fields should be able to pull a bibliographic data feed from some service in each field, merge these feeds also with data provided by Google Scholar and/or Microsoft, to come up with their own personal display of their bibliographic assets. Tools for this purpose are provided by the BibServer software and BibJSON format for bibliographic data which I developed over the last few years, supported at times by IMS and NSF. This software and underlying data format are now being further developed as open source projects with support from JISC via the Open Biblio 2 project. Another suitable tool is Harvard Open Scholar, a full-featured open-source web site-creation package designed for the academic community. Tools such as these empower individuals, research groups, departments and universities to take back and curate their bibliographic data and become full participants in an open bibliographic ecosystem.

It remains to be seen what agent or agents will end up providing the best service for individual researchers, departments or universities to display the bibliographic data they generate, and how best such data will be aggregated for search and discovery. But don’t wait for the big publishers and information brokers to monopolize this function. You can push the publishers and subscription services to support open access to your research work and open bibliographic data, by taking as many of the following steps as you can:

1. Post all new articles to arXiv at the same time you submit to a journal. After the article is refereed, revise the post with the final form sent to the publisher.

2. Ask your university library to help support arXiv, and to push indexing services to grant rights to republish selected data to the web with CC0 or similar license.

3. Make a Google Citation Profile for yourself, and link to it from your website.

4. Maintain on your website a comprehensive list of your publications, preferably in a machine-readable format such as BibTeX, BibJSON or logical equivalent, using open bibliographic management and display software such as Open Scholar or BibServer. Best of all, include abstracts or summaries, which arguably by US law can be posted online in accordance with “fair use” doctrine, even if the article itself is copyrighted by a publisher.

5. If legally possible, provide open access copies of the full text of your work, best of all on arXiv or an institutional repository, and second best as files beneath your homepage.

6. Register your website address with any open author identification service that is willing to take it, such as Thomas Krichel’s AuthorClaim Service or Google Scholar Citations, to make it easy for researchers to navigate back and forth between researcher websites and whatever author indexing capability develops from ORCID and other author identification systems.

7. Contact representatives of scholarly societies you may belong to, and encourage them to support modern web services to facilitate integration of bibliographic data they provide into personal and departmental web displays.

8. Ask your university library to register as a participant in ORCID, and to push ORCID to support open bibliographic data services.

9. Contact me if you are interested in assisting further development of open bibliographic data services in probability and statistics or related fields.