Data Scientist - Crossref

Applications for this position will be closed on February 5, 2025.

Are you interested in using data to understand the scholarly landscape better and help the scholarly community? Would you like to help Crossref make better-informed decisions? Join us as a Data Scientist.

Location: Remote and global (to partially overlap with working hours in European timezones)
Type: Full-time
Remuneration: 120k USD or local equivalent. Note this is a general guide (as there is no universal currency) and local currency analysis will take place before the final offer.
Reports to: Director of Data Science, Dominika Tkaczyk
Timeline: Advertise in January/February and offer by February/March

About the role

Crossref operates an open infrastructure that connects thousands of scholarly publishers, millions of research articles, and research objects to serve an increasingly diverse set of communities within scholarly publishing, research, funding, and beyond. Our system acts as the backbone for preserving and sharing the scholarly record. We offer a wide array of services to ensure that scholarly research metadata is registered, linked, and distributed. When members register their content with us, we collect both bibliographic and non-bibliographic metadata. We process it so that connections can be made between publications, people, organizations, and other associated outputs. We preserve the metadata we receive as a critical part of the scholarly record. We also make it openly available across a range of interfaces and formats so that the community can use it and build tools with it.

Over the last few years, we have witnessed substantial growth in the scholarly community, which has been reflected in the increase in the volume and variety of the data we deal with. On the one hand, this growth opens new possibilities for using the data to understand the scholarly landscape better, better serve the community, and make more informed decisions in a data-driven way. On the other hand, we are facing challenges related to the scale and complexity of the data. To fulfil our data-related ambitions and better address the challenges, Crossref has introduced a new Data Science team. The Data Science team will use scientific research and data science to deliver, assess, improve, and enrich scholarly metadata.

The Data Science team will provide in-house data expertise to the Programs and Technology teams relating to system improvements, community impact, metadata enrichment, and other key initiatives. We work in matrix program groups across three areas of focus: Co-create and reflect community trends; Contribute to the research nexus; and Open & sustainable operations. The Data Scientists will be embedded in program steering groups.

We are looking for two Data Scientists to join our Data Science team. The roles will have different focuses:

Data analysis & insights: The first role will be responsible for processing and analyzing the scholarly and operational data to help the scholarly community and inform Crossref’s strategy and decisions, as well as proposing new ideas for how we can use the data to fulfil our mission. This role will closely collaborate with all other Crossref teams to co-create ideas and transform them into new knowledge and working solutions.
Data availability & engineering: The second role will be responsible for detecting and assessing issues and gaps in the scholarly metadata, as well as researching strategies to increase the completeness and accuracy of the metadata and relationships, using internal and external data sources. This role will collaborate with the Technical and Program teams on transforming the research results into production-level services and workflows.

Key responsibilities

Data Analysis & Insights

Working with scholarly metadata and Crossref operational data to answer questions and gather evidence supporting or disproving hypotheses
Detecting, diagnosing and assessing problems and gaps in the scholarly metadata using automated and semi-automated techniques gathering insights from available data to help Crossref make well-informed strategic decisions
Analyzing trends and monitoring the results of various decisions and policies researching and proposing new data sources and research opportunities that help to support Crossref’s strategy
Evaluating and adopting appropriate data analysis tool(s) for the organisation to use for insights and reporting
Presenting the insights and new knowledge learned through data science activities internally and externally
Collaborating with all Crossref teams to understand their needs, co-create ideas and research questions, and propose data-driven approaches to address them
Collaborating with the data science and academic research community in the fields of bibliometrics, scientometrics, digital libraries, and similar
Engaging with members, users, and partner organisations to understand trends and needs, and contribute to others’ community initiatives and awareness
Implementing and promoting good practices around research, data management, data governance, and transparency

Data Availability & Engineering

Detecting, diagnosing and assessing problems and gaps in the scholarly metadata using automated and semi-automated techniques
Researching automatic and semi-automatic strategies to increase the completeness and accuracy of the metadata and relationships, for example, through data cleaning, metadata matching, metadata extraction from unstructured sources
Using evaluation techniques to estimate the quality of automated strategies
Proposing additional metadata sources, assessing the overlap between different databases and researching strategies for metadata merging
Collaborating with the Metadata team on modelling of the metadata gathered from multiple sources and inferred automatically, considering provenance information
Collaborating with the Technology and Program teams on transforming the research results into production-level services
Communicating the insights and new knowledge learned through data science activities internally and externally
Collaborating with the data science and academic research community in the fields of bibliometrics, scientometrics, digital libraries, and similar
Engaging with members, users, and partner organisations to understand trends and needs, and contribute to others’ community initiatives and awareness
Implementing and promoting good practices around research, data management, data governance, and transparency

About you

Essential experience and skills:

Minimum 3 years of hands-on experience in data science, data engineering, applied research, or similar
Proven track record of designing, running, and communicating data science experiments
Experience with using and developing data science-based tools and services
Experience with software and data engineering
Strong analytical and problem-solving skills
Expertise in Python programming language
Familiarity with machine learning concepts and methods
Familiarity with relational databases and REST APIs
Willingness to learn new skills and work with a variety of technologies
Ability to work independently in a self-directed way while consulting with others and collaborating openly
Ability to plan and project manage i.e. think ahead, outline goals, and organize steps to achieve the desired outcomes
Good communication skills with the ability to explain technical concepts to non-technical audiences
Awareness of the limitations of data e.g. relating to cultural or geographic biases

Nice-to-have skills:

Experience with scholarly metadata
Experience with metadata modelling
Knowledge of the dynamics of research communications and relevant communities
Experience with integrating data from multiple sources
Familiarity with JSON and mixed-content model XML
Experience with natural language processing techniques
Experience with statistical inference and sampling
Experience with large-scale data processing frameworks such as Spark
Experience with AWS services
Experience with search engines such as Elasticsearch
Experience with deploying and maintaining machine learning solutions in production
Experience with data visualization tools

About Crossref & the team

We’re a nonprofit membership organization that exists to make scholarly communications better. We rally the community; tag and share metadata; run an open infrastructure; play with technology; and make tools and services—all to help put research in context.

We envision a rich and reusable open network of relationships connecting research organizations, people, things, and actions; a scholarly record that the global community can build on forever, for the benefit of society. We are working towards this vision of a ‘Research Nexus’ by demonstrating the value of richer and connected open metadata, incentivising people to meet best practices, while making it easier to do so. “We” means 20,000+ members from 160 countries, 160+ million records, and nearly 2 billion monthly metadata queries from thousands of tools across the research ecosystem. We want to be a sustainable source of complete, open, and global scholarly metadata and relationships.

Take a look at our strategic agenda to see the planned work that aims to achieve the vision. The sustainability area aims to make transparent all the processes and procedures we follow to run the operation long-term, including our financials and our ongoing commitment to the Principles of Open Scholarly Infrastructure (POSI). The governance area describes our board and its role in community oversight.

It also takes a strong team – because reliable infrastructure needs committed people who contribute to and realise the vision, and thrive doing it. We are a distributed group of 46 dedicated people who like to play quizzes, talk about celery (sometimes cucumber), measure coffee intake, and create 100s of custom slack emojis. We enthusiastically support the Oxford comma but waver between use of American or British English. Occasionally we do some work to improve knowledge sharing worldwide— which we take a bit more seriously than ourselves. We do this through fair policies and working practices, a balanced approach to resourcing, and accountability to each other.

We can offer the successful candidate a challenging and fun environment to work in. Together we are dedicated to our global mission and we are constantly adapting to ensure we get there. Take a look at our organisation chart, the latest Annual Meeting recordings, and our financial information here.

Thinking of applying?

We especially encourage applications from people with backgrounds historically under-represented in research and scholarly communications. You can be based anywhere in the world where we can employ staff, either directly or through an employer of record.

Click here to apply!

Please strive to submit your application by February 5, 2025.

Anticipated salary for this role is approximately 120k USD-equivalent, paid in local currency. Crossref offers competitive compensation, benefits, flexible work arrangements, professional development opportunities, and a supportive work environment. As a nonprofit organization, we prioritize mission over profit.

Equal opportunities commitment

Crossref is committed to a policy of non-discrimination and equal opportunity for all employees and qualified applicants for employment without regard to race, colour, religion, sex, pregnancy or a condition related to pregnancy, sexual orientation, gender identity or expression, national origin, ancestry, age, physical or mental disability, genetic information, veteran status, uniform service member status, or any other protected class under applicable law. Crossref will make reasonable accommodations for qualified individuals with known disabilities in accordance with applicable law.

Get involved

Find a service

Documentation

About us

2025 March 05

Come ROR with us: Using ROR IDs in place of Funder IDs

2025 February 27

The GEM program - Year Two 2024

2025 January 29

Retraction Watch retractions now in the Crossref API

2025 January 28

POSI 2.0 feedback

Jobs