Blog

 8 minute read.

Real-time Stream of DOIs being cited in Wikipedia

TL;DR

Watch a real-time stream of DOIs being cited (and “un-cited!” ) in Wikipedia articles across the world: https://0-live-eventdata-crossref-org.libus.csd.mu.edu/live.html

Background

For years we’ve known that the Wikipedia was a major referrer of Crossref DOIs and about a year ago we confirmed that, in fact, the Wikipedia is the 8th largest refer of Crossref DOIs. We know that people follow the DOIs, too. This despite a fraction of Wikipedia citations to the scholarly literature even using DOIs. So back in August we decided to create a Wikimedia Ambassador programme. The goal of the programme was to promote the use of persistent identifiers in citation and attribution in Wikipedia articles. We would do this through outreach and through the development of better citation-related tools.

Remember when we originally wrote about our experiments with the PLOS ALM code and how that has transitioned into the DOI Event Tracking Pilot? In those posts we mentioned that one of the hurdles in gathering information about DOI events is the actual process of polling third party APIs for activity related to millions of DOIs. Most parties simply wouldn’t be willing handle the load of a 100K API calls an hour. Besides, polling is a tremendously inefficient process, only a fraction of DOIs are ever going to generate events, but we’d have to poll for each of them, repeatedly, forever, to get an accurate picture of DOI activity. We needed a better way. We needed to see if we could reverse this process and convince some parties to instead “push” us information whenever they saw DOI related events (e.g. citations, downloads, shares, etc). If only we could convince somebody to try this…

Wikipedia DOI Events

In December 2014 we took the opportunity of the 2014 PLOS/Crossref ALM Workshop in San Francisco too meet with Max Klein and Anthony Di Franco where we kicked off a very exciting project.

There’s always someone editing a Wikipedia somewhere in the world. In fact, you can see a dizzying live stream of edits. We thought that given that there are so many DOIs in Wikipedia, that live stream may contain some diamonds (DOIs are made of diamond, that’s how they can be persistent). Max and Anthony went away and came back with a demo that contains a surprising amount of DOI activity.

That demo is evolving into a concrete service, called Cocytus. It is running at Wikimedia Labs monitoring live edits as you read this.

For now we’re feeding that data into the DOI Events Collection app (which is an off-shoot of the Chronograph project). We are in the process of modifying the Lagotto code so that we can instead push those events into the DOI Event Tracking Instance.

The first DOI event we noticed was delightfully prosaic: The DOI for “The polymath project” is cited by the Wikipedia page for “Polymath Project”. Prosaic perhaps, but the authors of that paper probably want to know. Maybe they can help edit the page.

Or how about this. Someone wrote a a paper about why people edit Wikipedia and then it was cited by Wikipedia. And then the citation was removed. The plot thickens…

We’re interested in seeing how DOIs are used outside of the formal scholarly literature. What does that mean? We don’t fully know, that’s the point. We have retractions in scholarly literature (and our Crossmark metadata and service allow publishers to record that), but it’s a bit different on Wikipedia. Edit wars are fought over … well you can see for yourself.

Citations can slip in and out of articles. We saw the DOI 10.1001/archpediatrics.2011.832 deleted from “Bipolar disorder in children”. If we’d not been monitoring the live feed (we had considered analysing snapshots of the Wikipedia in bulk) we might never have seen that. This is part of what non-traditional citations means, and it wasn’t obvious until we’d seen it.

You can see this activity on the Chronograph’s stream. Or check your favourite DOI. Please be aware that we’re only collecting newly added citations as of today. We do intend to go back and back-fill, but that may take some time- as it * cough * requires polling again.

Some Technical Things

A few interesting things that happened as a result of all this:

Secure URLs

SSL and HTTPS were invented so you could do things like banking on the web without fear of interception or tampering. As the web becomes a more important part of life, many sites are upgrading from HTTP to HTTPS, the secure version. This is not only because your confidential details may be tampered with, but because certain governments might not like you reading certain materials.

Because of this, some time ago, Wikipedia decided to embark on an upgrade to HTTPS last year, and they are a certain way along the path. The IDF, who are responsible for running the DOI system, upgraded to HTTPS this Summer, although most DOIs are referred to by HTTP still.

We met with Dario Taraborelli at the ALM workshop and discussed the DOI referral data that is fed into the Chronograph. We put two and two together and realised that Wikipedia was linking to DOIs (which are mostly HTTP) from pages which might be served over HTTPS. New policies in HTML5 specify that referrer URL headers shouldn’t be sent from HTTPS to HTTP (in case there was something secret in them). The upshot of this is, if someone’s browsing Wikipedia via HTTPS and click on a normal DOI, we won’t know that the user came from Wikipedia. Not a huge problem today, but as Wikipedia switches over to entirely secure, we’re going to miss out on very useful information.

Fortunately, the HTML5 specification includes a way to fix this (without leaking sensitive information). We discussed this with Dario, and he did some research, and came up with a suggestion, which got discussed. It’s fascinating to watch a democratic process like this take place and take part in it.

We’re waiting to see how the discussion turns out, and hope that it all works out so we can continue to report on how amazing Wikipedia is at sending people to scholarly literature.

How shall I cite thee?

Another discussion grew out of that process, and we started talking to a Wikipedian called Nemo (note to Latin scholars: we weren’t just talking to ourselves). Nemo (real name Federico Leva) had a few suggestions of his own. Another way to solve the referrer problem is by using HTTPS URLs (HTML5 allows browsers to send the referrer domain when going from HTTPS to HTTPS).

This means going back to all the articles that use DOIs and change them from HTTP to HTTPS. Not as simple as it sounds, and it doesn’t sound simple. We started looking into how DOIs were cited on Wikipedia.

After some research we found that there are more ways that we expected to cite DOIs.

First, there’s the URL. You can see it in action in this article. URLs can take various forms.

Second there’s the official template tag, seen in action here:

<ref name="SCI-20140731">{{cite journal |title=Sustained miniaturization and anatomical innovation in the dinosaurian ancestors of birds |url=http://0-www-sciencemag-org.libus.csd.mu.edu/content/345/6196/562 |date=1 August 2014 |journal=[[Science (journal)|Science]] |volume=345 |issue=6196 |pages=562–566 |doi=10.1126/science.1252243 |accessdate=2 August 2014 |last1=Lee |first1=Michael S. Y. |first2=Andrea|last2=Cau |first3=Darren|last3=Naish|first4=Gareth J.|last4=Dyke}}</ref>

There’s a DOI in there somewhere. This is the best way to cite DOIs, firstly as it’s actually a proper traditional citation and there’s nothing magic about DOIs, secondly because it’s a template tag and can be re-rendered to look slightly different if needed.

Third there’s the old official DOI template tag that’s now discouraged:

<ref name="Example2006">{{Cite doi|10.1146/annurev.earth.33.092203.122621}}</ref>

And then there’s another one.

{{doi|10.5555/123456789}}

Knowing all this helps us find DOIs. But if we want to convert DOIs links in Wikipedia to use HTTPS, it means that there are more template tags to modify and more pages to re-render.

Nemo also put DOIs on the Interwiki Map which should make automatically changing some of the URLs a lot easier.

We’re very grateful to Nemo for his suggestions and work on this. We’ll report back!

The elephant in the room

Those of you who know how DOIs work will have spotted an unsecured elephant in the room. When you visit a DOI, you visit the URL, which hits the DOI resolver proxy server, which returns a message to your browser to redirect to the landing page on the publisher’s site.

Securely talking to the DOI resolver by using HTTPS instead of HTTP means that no-one can eavesdrop and see which DOI you are visiting, or tamper with the result and send you off to a different page. But the page you are sent to will be, in nearly all cases, still HTTP. Upgrading infrastructure isn’t trivial, and, with over 4000 members (mostly publishers), most Crossref DOIs will still redirect to standard HTTP pages for the foreseeable future.

You can keep as secure as possible by using HTTPS Everywhere.

Fin

There’s lots going on, watch this space to see developments. Thanks for reading this, and all the links. We’d love to know what you think.

Bootnote

Not long after this blog post was published we saw something very interesting.

Interesting DOI

That’s no DOI. We like interesting things, but they can panic us. This turned out to be a great example of why this kind of thing can be useful. A minute’s digging and we found the article edit:

Wikipedia typo

It turns out that this was a typo: someone put a title when they should have put in a DOI. And, as the event shows, this was removed from the Wikipedia article.

Related pages and blog posts

Page owner: Joe Wass   |   Last updated 2015-March-03