7 minute read.Drawing on the Research Nexus with Policy documents: Overton’s use of Crossref API
Update 2024-07-01: This post is based on an interview with Euan Adie, founder and director of Overton._
What is Overton?
Overton is a big database of government policy documents, also including sources like intergovernmental organizations, think tanks, and big NGOs and in general anyone who’s trying to influence a government policy maker. What we’re interested in is basically, taking all the good parts of the scholarly record and applying some of that to the policy world. By this we mean finding all the documents, finding what’s out there, collecting metadata for them consistently, fitting to our schema, extracting references from all the policy documents we find, adding links between them, and then we also do citation analysis.
What do you mean by the good parts of the scholarly record?
What I mean by the good parts of the scholarly record is, from a data perspective, having persistent open metadata for items on different stable, interoperable platforms and being able to build up layers of data to suit specific use cases. That’s a better approach than trying to do everything in a silo here and a silo there and trying to do stuff bit by bit or in a hundred different ways.
There’s also a bad part, which is less to do with metadata and more around citation analysis and responsible metrics. With all this data… as the famous Spiderman quote goes… with great power comes a great responsibility: once you start systematically collecting this data, it’s very easy to fall into the trap of thinking that if we can put numbers on it, and then maybe we could start reading meaning into those numbers, and then it spirals out of control. So the idea for Overton was: can we take the system, some of the infrastructure and apply those ideas? But then come at it already knowing where the later pitfalls are and try to avoid them.
What is your main use of Crossref resources?
We rely heavily on Crossref to link policy documents to the scholarly record. The question we’re trying to answer is: does this government document cite academic work? We work a lot with universities, think tanks, and IGOs. They’re asking where is the research we produce ending up? Is it being used by the government? In some countries, like the UK, there’s a big impact agenda where it’s quite important to demonstrate that for government funding. In the US as well, state universities for example aim to impact the local policy environment. Right? Are we producing things that went on to change life for local residents for the better? And that’s really what we’re trying to support. And so that’s one of the main use cases of the database.
Can you tell us a little bit more about the story of Overton, how did this idea start?
It really came from two things. The first one is that I’d always been interested in this area and before Overton, I founded a company called Altmetric.com, which was looking at kind of broader impact metrics for papers. And we looked at Twitter, and news, and blogs, and other things, including policy. But policy wasn’t a primary focus.
When I left Altmetric two things were happening in the UK – not that everything is about Brexit, but Brexit was happening, and then COVID happened as well. And in both cases, I think it just drove home to me that other people seemed to be very interested in the evidence that the government has used to make decisions. Be they good decisions like some of the evidence based initatives in COVID or bad decisions like Brexit. So, how can you find out what it was? And it is actually very difficult to do. You can’t really track back how this decision was made. I thought that there is a growing need for that kind of impact analysis. So the second thing was, can we do something that helps make it easy to see what evidence goes into policy? The scholarly evidence but also the other kind of policy influence that goes into any document or discussion.
What are the main challenges that you face when you are trying to retrieve these policy documents?
Well, first is another thing that the scholarly record does well, which is persistence. We have CLOCKSS and all the dark archives. So the whole idea is that if you have a DOI, if something moves, we can track it and it maintains the ID, and even if the publisher goes bust it’ll never disappear. For citing it, then there’s always going to be a copy of it somewhere available even if it’s in a library or a dark archive.
One of the biggest challenges with policy documents is that kind of persistence doesn’t exist… There are a lot of statistics about link rot, and they hold true for policy documents as much as anywhere else. Every year a percentage of the links everywhere basically break because websites are redesigned or a government changes, it’s even worse because it can be by design. If you think about it, a new government comes into power, they change… let’s say the Department of Agriculture and they merge it with the Department of Fisheries. That would refer to a completely new third thing. And the other two departments disappear or they start linking off, like, redirecting or whatever.
One of the challenges is just keeping track of all the changes in the landscape and constantly trying to stay on top of the data. And that’s a big part of what we do. Another challenge for us, and I think about it compared to journals, when you cite something in a scholarly document, you cite it in a given style, but there are no standards for referencing styles in policy documents. So even in the same document, we can see, like, four or five different ways of referring to something, and sometimes they’re missing important data and sometimes they’re not. And it means when we’re using Crossref search, we usually have much more unparsable text.
How has your experience been so far using our Crossref API or our services in general?
It’s been great. I would happily say this anywhere, I always talk about the Crossref API as being one of the best examples of a well-done scholarly infrastructure API. It’s well-documented. It’s fast. It’s clear. The rate limits are clear. It’s up when it should be up. I like that you can trust it. So the technical aspect is great. From an organizational aspect, in contrast with a lot of infrastructure in the scholarly world that you don’t know if it’s even going to be there in a given time, Crossref is pretty stable.
What would you say are the main challenges or things that we can improve in the future? What other expectations or suggestions do you have?
It depends, if we’re talking about how the service could be improved versus how the data could be improved. Data-wise, and I appreciate this is a publisher problem, not a Crossref one, but, we still have to pull other data from OpenAlex, for example, for things like affiliations just because it’s missing from so many articles. And then equally things like ORCID for authors. And in fact also disambiguation in general. This is a huge problem that either the user doesn’t solve or you end up using a hundred different author disambiguation systems. I don’t know if there’s necessarily something Crossref wants to get into, but there’s definitely not something out there generally accepted already.
Another kind of improvement I see is to make sure that changes in one API are reflected in the other, and they don’t get out of sync. When somebody updates their ORCID record, I’d like it reflected in the Crossref record if we’re using that as the “canonical” metadata record for the DOI. Retrospectively enriching records.
I think it’s harder than I expected to just find preprints because you can’t simply use the item type but I understand that this is maybe a bigger issue. So maybe it’s not for a short time.
Finally, this is very specific, but we experienced friction when going from the snapshots to having something useful, either in Elasticsearch or in, like, Postgres. It might be nice to have some open-source scripts to download and process everything, convert it to relational tables, or send it to an Elasticsearch cluster or something.