Start citing data now. Not later

6 minute read.

Start citing data now. Not later

Geoffrey Bilder – 2023 March 23

In MetadataCitationData CitationResearch Nexus

Recording data citations supports data reuse and aids research integrity and reproducibility. Crossref makes it easy for our members to submit data citations to support the scholarly record.

TL;DR

Citations are essential/core metadata that all members should submit for all articles, conference proceedings, preprints, and books. Submitting data citations to Crossref has long been possible. And it’s easy, you just need to:

Include data citations in the references section as you would for any other citation
Include a DOI or other persistent identifier for the data if it is available - just as you would for any other citation
Submit the references to Crossref through the content registration process as you would for any other record

And your data citations will flow through all the normal processes that Crossref applies to citations. And it will be distributed openly to the community (including DataCite!) via Crossref’s services and APIs. All data citations deposited with Crossref will be exposed in the (soon-to-be launched) Data Citation Corpus.

And then, you can sit back and congratulate yourself for making your publication more useful to researchers who want to be able to reuse the data underlying your publications.

Background

You might ask, “So if submitting Data Citations to Crossref has long been possible, why do you have to write this?”

Historically, authors did not cite data in the way they cited publications. Instead, they would often refer to the data in the main text of the article. This has made it hard to determine what data lay behind the research and/or access the data.

But the research community has increasingly recognized that data is a first-class research output and that we should treat it as such. In short, we should formally cite data.

But because citing data is a comparatively new practice, it has been subject to a lot of new analysis. And unsurprisingly, people analyzing data citation have discovered that there is a lot of nuance to citation of any kind.

There are lots of reasons for citing something. There are lots of internalized conventions for citing things. And there are different conventions for citation for different research objects. And SSH citation practice differs from STEM. And legal citation practices are different from scholarly citation practices. And citation practices even vary by subdiscipline and by journal.

Those who have been looking at what it means to “cite data” have naturally stumbled into a thicket of divergent practices - some of which are historical holdovers, some of which are stylistic preferences, and some of which are clearly adaptations to deal with the specific needs of certain research objects/containers or different disciplines.

The temptation has been to try and rationalize this before extending the practice of citation to data.

“Maybe because data is a distinct record type, we should include the fact that it is a data citation in the citation itself?”

“Maybe because people cite data for different reasons, we should include a typology of citation types in all data citations?”

And so you may hear some people say, “hold off on data citation - we don’t have an optimal way to do it yet, and it can be very complicated.”

But guess what?

We currently don’t label citations to monographs as “citation to monograph.”

And we don’t currently include the reason for citation when we are citing a journal article.

It would be very cool if we did. And it would likely make citations even more useful if we did.

But citations are already useful even without these features. And so, to delay citing data indefinitely because we have an opportunity to improve the act of citation is just perverse. Our community has always opted for progress over perfection.

For one thing - the efforts are not mutually exclusive. We can start citing data with the current limitations of citation practices and simultaneously propose mechanisms for making citation more useful in the future, including new guidelines to deal with the unique issues that citing data poses.

But in the meantime, we will be doing researchers a giant favour if we at least include our imperfect and ambiguous, and unconventional references to data in the references section of an article so that they can be accessed and processed along with all the other imperfect, ambiguous and variant citations that we find so useful.

Some of our members are already doing this. They have been for a long time. And they haven’t found it any more complicated than managing non-data references in the past.

Join them and make your metadata more useful.

Cite data now. Don’t put it off.

And Crossref will continue to work with DataCite and the rest of the community to make the distribution even easier and more useful.

So who is already citing data?

Top 10 members depositing data citations from November-May 2022

(broken down by DOI prefix, which is why you see some publishers listed twice):

Prefix	Member name	Data citations deposited
10.1038	Springer Science and Business Media LLC	7174
10.1016	Elsevier BV	6527
10.1007	Springer Science and Business Media LLC	4748
10.5194	Copernicus GmbH	3017
10.1080	Informa UK Limited	2346
10.1177	SAGE Publications	2082
10.1002	Wiley	2048
10.1111	Wiley	1888
10.1108	Emerald	1876
10.3390	MDPI AG	1827

Top 10 data citations per deposited work

(again, broken down by prefix)

Member name	Prefix	Data citations deposited	Data citations per work
Consortium Erudit	10.7202	580	1.149
SLACK, Inc.	10.3928	462	0.646
S. Karger AG	10.1159	1653	0.532
Proceedings of the National Academy of Sciences	10.1073	973	0.502
American Academy of Pediatrics (AAP)	10.1542	486	0.397
F1000 Research Ltd	10.12688	552	0.341
American Association for the Advancement of Science (AAAS)	10.1126	952	0.317
Springer Science and Business Media LLC	10.1038	7174	0.231
JMIR Publications Inc.	10.2196	864	0.187
American Geophysical Union (AGU)	10.1029	692	0.166

These are for the prefixes with the most data citations deposited (>500 in 6 months) so there might be smaller members doing better than this.

Summaries are great, but I want to see some actual examples!

Here are some examples showing how data is cited by our members:

This eLife article: https://0-doi-org.libus.csd.mu.edu/10.7554/eLife.26410 cites this dataset in Dryad https://0-doi-org.libus.csd.mu.edu/10.5061/dryad.854j2.
This Copernicus article: https://0-doi-org.libus.csd.mu.edu/10.5194/acp-22-7105-2022 cite to this dataset https://0-doi-org.libus.csd.mu.edu/10.24381/cds.bd0915c6
This Sciendo article: https://0-doi-org.libus.csd.mu.edu/10.2478/plc-2021-0008 cites this APA-hosted language competence test https://0-doi-org.libus.csd.mu.edu/10.1037/t15159-000
This De Gruyter article: https://0-doi-org.libus.csd.mu.edu/10.1515/opth-2020-0160 cites this bibliography at Oxford Bibliographies: https://0-doi-org.libus.csd.mu.edu/10.1093/OBO/9780195396584-0012

And here are some example API requests for discovering more metadata citations. You can use these API requests as examples and adapt to your own needs.

Get involved

Find a service

Documentation

About us

2025 January 29

Retraction Watch retractions now in the Crossref API

2025 January 28

POSI 2.0 feedback

2025 January 08

Metadata matching: beyond correctness

2024 December 12

A progress update and a renewed commitment to community

Blog