Through the use of the Levenshtein algorithm, I am achieving perfection when it comes to searching VIAF. Well, almost.
I am making significant progress with VIAF Finder [0], but now I have exploited the use of the Levenshtein algorithm. In fact, I believe I am now able to programmatically choose VIAF identifiers for more than 50 or 60 percent of the authority records.
The Levenshtein algorithm measures the “distance” between two strings. [1] This distance is really the number of keystrokes necessary to change one string into another. For example, the distance between “eric” and “erik” is 1. Similarly the distance between “Stefano B” and “Stefano B.” is still 1. Along with a colleague (Stefano Bargioni), I took a long, hard look at the source code of an OpenRefine reconciliation service which uses VIAF as the backend database. [2] The code included the calculation of a ratio to denote the relative distance of two strings. This ratio is the quotient of the longest string minus the Levenshtein distance divided by the length of the longest string. From the first example, the distance is 1 and the length of the string “eric” is 4, thus the ratio is (4 – 1) / 4, which equals 0.75. In other words, 75% of the characters are correct. In the second example, “Stefano B.” is 10 characters long, and thus the ratio is (10 – 1) / 10, which equals 0.9. In other words, the second example is more correct than the first example.
Using the value of MARC 1xx$a of an authority file, I can then query VIAF. The SRU interface returns 0 or more hits. I can then compare my search string with the search results to create a ranked list of choices. Based on this ranking, I am able to more intelligently choose VIAF identifiers. For example, from my debugging output, if I get 0 hits, then I do nothing:
query: Lucariello, Donato hits: 0
If I get too many hits, then I still do nothing:
query: Lucas Lucas, Ramón hits: 18 warning: search results out of bounds; consider increasing MAX
If I get 1 hit, then I automatically save the result, which seems to be correct/accurate most of the time, even though the Levenshtein distance may be large:
query: Lucaites, John Louis hits: 1 score: 0.250 John Lucaites (57801579) action: perfection achieved (updated name and id)
If I get many hits, and one of them exactly matches my query, then I “achieved perfection” and I save the identifier:
query: Lucas, John Randolph hits: 3 score: 1.000 Lucas, John Randolph (248129560) score: 0.650 Lucas, John R. 1929- (98019197) score: 0.500 Lucas, J. R. 1929- (2610145857009722920913) action: perfection achieved (updated name and id)
If I get many hits, and many of them are exact matches, then I simply use the first one (even though it might not be the “best” one):
query: Lucifer Calaritanus hits: 5 score: 1.000 Lucifer Calaritanus (189238587) score: 1.000 Lucifer Calaritanus (187743694) score: 0.633 Luciferus Calaritanus -ca. 370 (1570145857019022921123) score: 0.514 Lucifer Calaritanus gest. 370 n. Chr. (798145857991023021603) score: 0.417 Lucifer, Bp. of Cagliari, d. ca. 370 (64799542) action: perfection achieved (updated name and id)
If I get many hits, and none of them are perfect, but the ratio is above a configured threshold (0.949), then that is good enough for me (even if the selected record is not the “best” one):
query: Palanque, Jean-Remy hits: 5 score: 0.950 Palanque, Jean-Rémy (106963448) score: 0.692 Palanque, Jean-Rémy, 1898- (46765569) score: 0.667 Palanque, Jean Rémy, 1898- (165029580) score: 0.514 Palanque, J. R. (Jean-Rémy), n. 1898 (316408095) score: 0.190 Marrou-Davenson, Henri-Irénée, 1904-1977 (2473942) action: perfection achieved (updated name and id)
By exploiting the Levenshtein algorithm, and by learning from the good work of others, I have been able to programmatically select VIAF identifiers for more than half of my authority records. When one has as many as 120,000 records to process, this is a good thing. Moreover, this use of the Levenshtein algorithm seems to produce more complete results when compared to the VIAF AutoSuggest API. AutoSuggest identified approximately 20 percent of my VIAF identifiers, while my Levenshtein algorithm/logic identifies more than 40 or 50 percent. AutoSuggest is much faster though. Much.
Fun with the intelligent use of computers, and think of the possibilities.
[0] VIAF Finder – ./2016/05/viaf-finder/
[1] Levenshtein – http://bit.ly/1Wz3qZC
[2] reconciliation service – https://github.com/codeforkjeff/refine_viaf
Thank you for your kind words about my refine_viaf project!
Very interesting to read about your success rate compared with VIAF’s AutoSuggest. I wonder if AutoSuggest is optimized for fuzzy matches, making it work less well with names that are already fairly close matches. I know I’ve experienced odd results with ElasticSearch and Apache Solr if you don’t specifically tune things for the data you’re working with. If VIAF uses one of those search products, that might explain it.
I’m so pleased to have discovered your incredibly thoughtful blog. You have a new regular reader.