Dominika Tkaczyk – 2018 December 18
In my previous blog post, Matchmaker, matchmaker, make me a match, I compared four approaches for reference matching. The comparison was done using a dataset composed of automatically-generated reference strings. Now it’s time for the matching algorithms to face the real enemy: the unstructured reference strings deposited with Crossref by some members. Are the matching algorithms ready for this challenge? Which algorithm will prove worthy of becoming the guardian of the mighty citation network? Buckle up and enjoy our second matching battle!
In reference matching, we try to find the DOI of the document referenced by a given input reference. The input reference can have a structured form (a collection of metadata fields) and/or an unstructured form (a string formatted in a certain citation style).
In my previous blog post, I used reference strings generated automatically to compare four matching algorithms: Crossref’s legacy approach based on reference parsing and three variations of search-based matching. The best algorithm turned out to be Search-Based Matching with Validation (SBMV). SBMV uses our REST API’s bibliographic search function to select the candidate target documents, and a separate validation-scoring procedure to choose the final target document. The legacy approach and SBMV achieved very similar average precision, and SBMV was much better in average recall.
This comparison had important limitations, which affect the interpretation of these results.
First of all, the reference strings in the dataset might be too perfect. Since they were generated automatically from the Crossref metadata records, any piece of information present in the string, such as the title or the name of the author, will exactly match the information in Crossref’s metadata. In such a case, a matcher comparing the string against the record can simply apply exact matching and everything should be fine.
In real life, however, we should expect all sorts of errors and noise in the reference strings. For example, a string might have been manually typed by a human, so it can have typos. The string might have been scraped from the PDF file, in which case it could have unusual unicode characters, ligatures or missing and extra spaces. A string can also have typical OCR errors, if it was extracted from a scan.
These problems are typical for messy real-life data, and our matching algorithms should be robust enough to handle them. However, when we evaluate and compare approaches using the perfect reference strings, the results won’t tell us how well the algorithms handle harder, noisy cases. After all, even if you repeatedly win chess games against your father, it doesn’t mean you will likely defeat Garry Kasparov (unless, of course, you are Garry Kasparov’s child, in which case, please pass on our regards to your dad!).
Even though I attempted to make the data more similar to the noisy real-life data by simulating some of the possible errors (typos, missing/extra spaces) in two styles, this might not be enough. We simply don’t know the typical distribution of the errors, or even what all the possible errors are, so our data was probably still far from the real, noisy reference strings.
The differences in the distributions are a second major issue with the previous experiment. To build the dataset, I used a random sample from Crossref metadata, so the distribution of the cited item types (journal paper, conference proceeding, book chapter, etc.) reflects the overall distribution in our collection. However, the distribution in real life might be different if, for example, journal papers are on average cited more often than conference proceedings.
Similarly, the distribution of the citation styles is most likely different. To generate the reference strings, I used 11 styles distributed uniformly, while the real distribution most likely contains more styles and is skewed.
All these issues can be summarized as: the data used in my previous experiment is different from the data our matching algorithms have to deal with in the production system. Why is this important? Because in such a case, the evaluation results do not reflect the real performance in our system, just like the child’s score on the math exam says nothing about their score on the history test. We can hope my previous results accurately showed the strengths and weaknesses of each algorithm, but the estimations could be far off.
So, can we do better? Sure!
This time, instead of automatically-generated reference strings, I will use real reference strings found in the Crossref metadata. This will give us a much better picture of the matching algorithms and their real-life performance.
This time the evaluation dataset is composed of 2,000 unstructured reference strings from the Crossref metadata, along with the target true DOIs. The dataset was prepared mostly manually:
The metrics this time are based on the citation links. A citation link points from the reference (or the document containing the reference) to the referenced (target) document.
When we apply a matching algorithm to a set of reference strings in our collection, we get a set of citation links between our documents. I will call those citation links returned links.
On the other hand, in our collection we have real, true links between the documents. In the best-case scenario, the set of true links and the set of returned links are identical. But we don’t live in a perfect world and our matching algorithms make mistakes.
To measure how close the returned links are to the true links, I used precision, recall and F1. This time they are calculated over all citation links in the dataset. More specifically:
In the previous experiment, I also used precision, recall and F1, but they were calculated for each target document and then averaged. This time precision, recall and F1 are not averaged but simply calculated over all citation links. This is a more natural approach, since now the dataset comprises isolated reference strings rather than target documents, and in practice each target document has at most one incoming reference.
I tested the same four approaches as before:
All the thresholds are parameters which have to be set prior to the matching. The thresholds used in the experiments were chosen using a separate dataset, as the values maximizing the F1 of each algorithm.
The plot shows the overall results of all tested approaches:
The exact values are also given in the table (the best result for each metric is bolded):
|SBM (simple threshold)||0.8686||0.8191||0.8431|
|SBM (normalized threshold)||0.7712||0.9121||0.8358|
As we can see, the legacy approach is the best in precision, slightly outperforming SBMV. In recall, SBMV is clearly the best, which also decided about its victory over the legacy approach in F1.
How do these results compare to the results from my previous blog post? The overall trends (the legacy approach slightly outperforms SBMV in precision, and SBMV outperforms the legacy approach in recall and F1) are the same. The most important differences are: 1) on the real dataset SBM without validation is worse than the legacy approach, and 2) this time the algorithms achieved much higher recall. These differences are most likely related to the difference in data distributions explained before.
Let’s look at a few example cases where SBMV successfully returned the correct DOI, while the legacy approach failed.
Lundqvist D, Flykt A, Ohman A: The Karolinska Directed Emotional Faces - KDEF, CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet. 1998
The target item is a dataset, which means unusual metadata fields and an unusual reference string.
Schminck, A. , ‘The Beginnings and Origins of the “Macedonian” Dynasty’ in J. Burke and R. Scott , eds., Byzantine Macedonia: Identity, Image and History (Melbourne, 2000), 61–8.
This is an example of a book chapter. The reference string contains special quotes and dash characters.
R. Schneider,On the Aleksandrov-Fenchel inequality, inDiscrete Geometry and Convexity (J. E. Goodman, E. Lutwak, J. Malkevitch and R. Pollack, eds.), Annals of the New York Academy of Sciences440 (1985), 132–141.
In this case, spaces are missing in the reference string, which might be problematic for the parsing.
R. B. Husar andE. M. Sparrow, Int. J. Heat Mass Transfer11, 1206 (1968).
This is another example of a reference string with missing spaces.
F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir. Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations, 1(1), 2014.
In this case authors are missing in the Crossref metadata.
Li KZ, Shen XT, Li HJ, Zhang SY, Feng T, Zhang LL. Ablation of the Carbon/carbon Composite Nozzle-throats in a Small Solid Rocket Motor[J]. Carbon, 2011, 49: 1 208–1 215
Here we have unexpected spaces inside page numbers.
N. Kaloper, A. Lawrence and L. Sorbo, An Ignoble Approach to Large Field Inflation, JCAP 03 (2011) 023 [ arXiv:1101.0026 ] [ INSPIRE ].
In this case we have an acronym of the journal name and additional arXiv id.
KrönerE. ?Stress space and strain space continuum mechanics?, Phys. Stat. Sol. (b), 144 (1987) 39?44.
This reference string has a missing space, a missing word in the title, and incorrectly encoded special characters.
Suyemoto K. L., (1998) The functions of self-mutilationClinical Psychology Review 18(5): 531–554
In this case the space is missing between the title and the journal name.
Ono , N. 2011 Stable and fast update rules for independent vector analysis based on auxiliary function technique Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 189 192
The parsing can also have problems with missing punctuation, like in this case.
Hybertsen M.S., Witzigmann B., Alam M.A., Smith R.K. (2002) 1 113
In this case both title and journal name are missing from the reference string.
We can see from these examples that SBMV is fairly robust and able to deal with a small amount of noise in the metadata and reference strings.
What about the errors SBMV made? From the perspective of citation links, we have two types of errors:
When we apply SBMV instead of the legacy approach, the fraction of false positives within the returned links increases from 1.05% to 1.91%, and the fraction of false negatives within the true links decreases from 13.15% to 5.44%. This means with SBMV:
We can also classify all the references in the dataset into several categories, based on the values of true and returned DOIs:
We have the following categories:
Note that in terms of these categories, precision is equal to:
And recall is equal to:
What are the most common causes of SBMV’s errors?
The most important limitation is the size of the dataset. Every item had to be verified manually, which significantly limited the possibility of creating a large set and also using a lot of independent sets.
Finally, the numbers reported here still don’t reflect the overall precision and recall of the current links in the Crossref metadata. This is because:
The next experiment will be related to the structured references. Similarly as here, I will try to estimate the performance of the search-based matching approach and compare it to the performance of the legacy approach.
The evaluation framework, evaluation data and experiments related to the reference matching are available in the repository https://github.com/CrossRef/reference-matching-evaluation. Future experiments will be added there as well.
https://github.com/CrossRef/reference-matching-evaluation also contains the Python implementation of the SBMV algorithm. The Java implementation of SBMV is available in the repository https://github.com/CrossRef/search-based-reference-matcher.
2020 March 27
2020 March 24