The Crossref Curriculum

Text and data mining for members

Even if you already have an API, the Crossref REST API provides additional benefits: it’s a common, standards-based API that works across members. Researchers having to learn many different member APIs for TDM projects doesn’t scale well.

It is up to you to decide formats for your full-text in: some offer PDF, others XML, and some plain text. Some members vary what they deliver depending on the age of the content or other variables. The Crossref REST API does not provide automatic access to subscription content - access to subscription content is managed on your site using your existing access control systems.

As a member, you need to do two things to participate in the Crossref text and data mining service:

  1. Deposit the link to full-text in the metadata for each DOI so researchers can follow it to access your content
  2. Deposit a license URL in the metadata for each DOI so researchers can use this to find out if they have permission to carry out TDM with your content item

Add this information to Crossref using a resource-only deposit or by uploading a .csv file containing the URLs and the related DOIs.

If you are concerned about the impact of automated TDM harvesters on your site performance, you may choose to implement rate-limiting headers.

Rate limiting

TDM may change the volume of traffic that your servers have to handle when researchers download large numbers of files in bulk. You can mitigate performance issues with rate limiting.

We have defined a set of standard HTTPS headers that can be used by servers to convey rate-limiting information to automated text and data mining tools. Well-behaved TDM tools can simply look for these headers when they query member sites in order to understand how to behave so as not to affect the site’s performance. The headers allow a member to define a rate limit window - a time span, such as a minute, an hour, or a day. The member can then specify:

Header name Example value Explanation
CR-TDM-Rate-Limit 1500 Maximum number of full-text downloads that are allowed to be performed in the defined rate limit window
CR-TDM-Rate-Limit-Remaining 76 Number of downloads left for the current rate limit window
CR-TDM-Rate-Limit-Reset 1378072800 Remaining time (in UTC epoch seconds) before the rate limit resets and a new rate limit window is started

Crossref does not provide or enforce this rate limiting - it’s up to you to implement it if required, and to define a rate limit appropriate for your servers.

Example member site

We have created TinyPub to show an implementation of the Crossref REST API, including rate limiting and IP-based subscription access. You can download this code for reference, but please note that it’s just to illustrate the workings of the system, and is not intended for production.

Page owner: Laura J. Wilkinson   |   Last updated 2020-April-08