For the third year in a row, Crossref hosted a roundtable on research integrity prior to the Frankfurt book fair. This year the event looked at Crossmark, our tool to display retractions and other post-publication updates to readers.
Since the start of 2024, we have been carrying out a consultation on Crossmark, gathering feedback and input from a range of members. The roundtable discussion was a chance to check and refine some of the conclusions we’ve come to, and gather more suggestions on the way forward.
In our previous blog post in this series, we explained why no metadata matching strategy can return perfect results. Thankfully, however, this does not mean that it’s impossible to know anything about the quality of matching. Indeed, we can (and should!) measure how close (or far) we are from achieving perfection with our matching. Read on to learn how this can be done!
How about we start with a quiz? Imagine a database of scholarly metadata that needs to be enriched with identifiers, such as ORCIDs or ROR IDs.
We’re in year two of the Resourcing Crossref for Future Sustainability (RCFS) research. This report provides an update on progress to date, specifically on research we’ve conducted to better understand the impact of our fees and possible changes.
Crossref is in a good financial position with our current fees, which haven’t increased in 20 years. This project is seeking to future-proof our fees by:
Making fees more equitable Simplifying our complex fee schedule Rebalancing revenue sources In order to review all aspects of our fees, we’ve planned five projects to look into specific aspects of our current fees that may need to change to achieve the goals above.
On behalf of the Nominating Committee, I’m pleased to share the slate of candidates for the 2024 board election.
Each year we do an open call for board interest. This year, the Nominating Committee received 53 submissions from members worldwide to fill four open board seats.
We maintain a balanced board of 8 large member seats and 8 small member seats. Size is determined based on the organization’s membership tier (small members fall in the $0-$1,650 tiers and large members in the $3,900 - $50,000 tiers).
What’s in the metadata matters because it is So.Heavily.Used.
You might be tired of hearing me say it but that doesn’t make it any less true. Our open APIs now see over 1 billion queries per month. The metadata is ingested, displayed and redistributed by a vast, global array of systems and services that in whole or in part are often designed to point users to relevant content. It’s also heavily used by researchers, who author the content that is described in the metadata they analyze. It’s an interconnected supply chain of users large and small, occasional and entirely reliant on regular querying.
Tl;dr
Crossref recently wrapped up our first Working Group for users of the metadata, a group that plays a key role in discoverability and the metadata supply chain. You can jump directly to the stakeholder-specific recommendations or take a moment to share your use case or feedback.
Why a metadata user group? Why now?
A majority of Crossref metadata users rely on our free, open APIs and many are anonymous. A small but growing group of users pay for a guaranteed service level option and while their individual needs and feedback have long been integrated into Crossref’s work, as a group they provide a window into the workflows and use cases for the metadata of the scholarly record. As this use grows in strategic importance, to both Crossref and the wider community, it was clear that we might be overdue for a deeper dive into user workflows.
In 2021, we surveyed these subscribers for their feedback and brought together a few volunteers over a series of 5 calls to dig into a number of topics specific to regular users of metadata. This group, the first primarily non-member working group at Crossref, wrapped up in December 2022, and we are grateful for their time:
Achraf Azhar, Centre pour la Communication Scientifique Directe (CCSD)
Satam Choudhury, HighWire Press
Nees Jan van Eck, CWTS-Leiden University
Bethany Harris, Jisc
Ajay Kumar, Nova Techset
David Levy, Pubmill
Bruno Ohana, biologit
Michael Parkin, European Bioinformatics Institute (EMBL-EBI)
Axton Pitt, Litmaps
Dave Schott, Copyright Clearance Center (CCC)
Stephan Stahlschmidt, German Centre for Higher Education Research and Science Studies (DZHW)
This post is intended to summarize the work we did, to highlight the role of metadata users in research communications, to provide a few ideas for future efforts and, crucially, to get your feedback on the findings and recommendations. Though this particular group set out to meet for a limited time, we hope this report helps facilitate ongoing conversations with the user community.
Survey Highlights
If you’re looking for an easy overview of users and use cases, here’s a great starting point.
If you interpret this graphic to mean that there is a lot of variety centered on a few high level use cases, the survey and our experiences with users certainly supports that. A few key takeaways from the 2021 survey may be useful context:
Frequency of use: At least 60% of respondents query metadata on a daily basis
Use cases
Finding and enhancing metadata as well as using it for general discovery are all common use cases
For most users, matching DOIs and citations is a common need but for a significant group, it is their primary use case
Analyzing the corpus for research was a consistent use case for 13% of respondents
Metadata of particular interest
Abstracts are the most desirable non-bibliographic metadata, followed by affiliation information, including RORs
Some other elements (beyond citation information) that respondents find useful are:
NB: The survey did not ask about references but we are frequently asked why they’re not included more often.
It’s also worth noting that about a third of respondents said that correct metadata is more important to them than any particular element.
There is more to this survey that isn’t covered here but it was kept fairly short to help with the response rate. Knowing we would have some focused time to discuss issues too numerous or nuanced to reasonably address in a survey, we compiled a long list of questions and topics for the Working Group then followed up with a second, more detailed survey to kick off the meeting series.
What we set out to address
We had three primary goals for this Working Group:
Highlight the efforts of metadata users in enabling discovery and discoverability
Of course, everyone involved had some questions and topics of interest to cover, including (but not limited to):
Understanding publisher workflows
How best to introduce changes, e.g. for a high volume of updated records
Understanding the Crossref schema
Query efficiencies, i.e. ‘tips and tricks’ (here for the REST API)
Which scripts, tools and/or programs are used in workflows
What other metadata sources are used
What kind of normalization or processing is done on ingest
How metadata errors are handled
What did we learn?
Workflows I started with the admittedly ambitious goal of collecting a library of workflows. After a few years of working with users, I learned never to assume what a user was doing with the metadata, why or how. For example, some subscribers use Plus snapshots (a monthly set of all records), regularly or occasionally and some don’t use them at all. Understanding why users make the choices they do is always helpful.
In my experience, workflows are frequently characterized as “set it and forget it.” It’s hard to know how often and how easily they might be adapted when, for example, a new record type like peer review reports becomes available. So, it’s worth exploring when and how to highlight to users changes that might be of interest.
As it turned out, half the group had their workflows mostly or fully documented. The rest are partially documented, not documented at all or the availability of documentation was unknown. Helping users document their workflows, to the extent possible, should be a mutually beneficial effort to explore going forward. We’re doing similar work with the aim of making ours more transparent and replicable.
Feedback on subscriber services User feedback might be the most obvious and directly consequential work of this group, at least for Crossref - understanding how well the services used meet their needs and what might be improved.
One frequent suggestion for improvement is faster response time on queries. This is an area we’ve focused on for some time, because refining queries to be more efficient is often the most straightforward way to improve response times and one reason for the emphasis on workflows.
We also discussed the possibility of whether or how to notify users of changes of interest. Just defining “change” is complex since they are so frequent and may often be considered very minor. We’ve been experimenting a bit over the past few years with notifying these users in cases where we’re aware of upcoming large volumes of changes, which is sometimes the case when landing page URLs are updated due to a platform change, for example. It was incredibly useful to discuss with the group what volume of records would be a useful threshold to trigger a notification (100K if you’re curious).
But perhaps the most common feedback we get from all users is on the metadata itself and the myriad quality issues involved. The group spent a fair amount of time discussing how this affects their work and shared a few examples of notable concerns:
Author name issues, e.g. ‘Anonymous’ is an option for authors but that or things like ‘n/a’ are sometimes used in surname fields
Invalid DOIs are sometimes found in reference lists
Garbled characters from text not rendering properly
Affiliation information is often not included or incomplete (e.g. doesn’t include RORs)
Inconsistencies in commonly included information, e.g. ISSNs
It’s worth noting that a common misunderstanding - not just among users - is what is required in the metadata. Users nearly always expect more metadata and more consistency than is actually available. The introduction of Participation Reports a few years ago was a very useful start to what is an ongoing discussion about the variable nature of metadata quality and completeness.
The role of metadata users in discoverability of content is key in my view and one that often doesn’t get enough attention, especially given that the systems and services that use this information often use it to point their own users to relevant resources. And because they work so closely with the metadata, users frequently report errors and so serve as a sort of de facto quality control. So, unfortunately, the effects of incomplete or incorrect metadata on these users might be the most powerful way to highlight the need for more and better metadata.
What are the recommendations?
In discussions with the Working Group, a few themes emerged, largely around best practices, which, by their nature, tend to be aspirational.
If you’re not already familiar with the personas and Best Practices and Principles of Metadata 2020, that is a useful starting point (I am admittedly biased here!) and many are echoed in the following recommendations:
For users:
Document and periodically review workflows
Report errors to members or to Crossref support and reflect corrections when they’re made (metadata and content)
Define a set of metadata changes, e.g. to affiliations, to further the discussion around thresholds for notifying users of ‘high volumes’ of changes
Provide an output schema.
Continue refining the input schema to include information like preprint server name, journal article sub types (research article, review article, letter, editorial, etc.), corresponding author flags, raw funding statement texts, provenance information, etc.
Collaborate on improving processes for reporting metadata errors and making corrections and enhancements
For metadata providers (publishers, funders and their service providers):
Consistency is important, e.g. using the same, correct relationship for preprint to VoR links for all records
Workarounds such as putting information into a field that is ‘close’ but not meant for it can be considered a kind of error
Understand the roles and needs of users in amplifying your outputs
Respond promptly to reports of metadata errors
Whenever possible, provide PIDs (ORCID IDs, ROR IDs, etc.) in addition to (not as a substitute for) textual metadata
What is still unclear or unfinished?
Honestly, a lot. We knew from the outset that the group would conclude with much more work to be done, in part because there is so much variety under the umbrella of metadata users and many answers lead to more questions and in part because the metadata and the user community will continue to evolve. Even without a standing group that meets regularly, it’s very much an ongoing conversation and we invite you to join it.
Now it’s your turn–can you help fill in the blanks?
Does any or all of this resonate with you? Do you take exception to any of it? Do you have suggestions for continuing the conversation?
Specifically, can you help fill in any of the literal blanks? We’ve prepared a short survey that we hope can serve as a template for collecting (anonymous) workflows. Please take just a few minutes to answer a few short questions such as how often you query for metadata.
If you are willing to share examples of your queries or have questions or further comments, please get in touch.