Achieving perfection

Through the use of the Levenshtein algorithm, I am achieving perfection when it comes to searching VIAF. Well, almost.

trastevareI am making significant progress with VIAF Finder [0], but now I have exploited the use of the Levenshtein algorithm. In fact, I believe I am now able to programmatically choose VIAF identifiers for more than 50 or 60 percent of the authority records.

The Levenshtein algorithm measures the “distance” between two strings. [1] This distance is really the number of keystrokes necessary to change one string into another. For example, the distance between “eric” and “erik” is 1. Similarly the distance between “Stefano B” and “Stefano B.” is still 1. Along with a colleague (Stefano Bargioni), I took a long, hard look at the source code of an OpenRefine reconciliation service which uses VIAF as the backend database. [2] The code included the calculation of a ratio to denote the relative distance of two strings. This ratio is the quotient of the longest string minus the Levenshtein distance divided by the length of the longest string. From the first example, the distance is 1 and the length of the string “eric” is 4, thus the ratio is (4 – 1) / 4, which equals 0.75. In other words, 75% of the characters are correct. In the second example, “Stefano B.” is 10 characters long, and thus the ratio is (10 – 1) / 10, which equals 0.9. In other words, the second example is more correct than the first example.

Using the value of MARC 1xx$a of an authority file, I can then query VIAF. The SRU interface returns 0 or more hits. I can then compare my search string with the search results to create a ranked list of choices. Based on this ranking, I am able to more intelligently choose VIAF identifiers. For example, from my debugging output, if I get 0 hits, then I do nothing:

       query: Lucariello, Donato
        hits: 0

If I get too many hits, then I still do nothing:

       query: Lucas Lucas, Ramón
        hits: 18
     warning: search results out of bounds; consider increasing MAX

If I get 1 hit, then I automatically save the result, which seems to be correct/accurate most of the time, even though the Levenshtein distance may be large:

       query: Lucaites, John Louis
        hits: 1
       score: 0.250     John Lucaites (57801579)
      action: perfection achieved (updated name and id)

If I get many hits, and one of them exactly matches my query, then I “achieved perfection” and I save the identifier:

       query: Lucas, John Randolph
        hits: 3
       score: 1.000     Lucas, John Randolph (248129560)
       score: 0.650     Lucas, John R. 1929- (98019197)
       score: 0.500     Lucas, J. R. 1929- (2610145857009722920913)
      action: perfection achieved (updated name and id)

If I get many hits, and many of them are exact matches, then I simply use the first one (even though it might not be the “best” one):

       query: Lucifer Calaritanus
        hits: 5
       score: 1.000     Lucifer Calaritanus (189238587)
       score: 1.000     Lucifer Calaritanus (187743694)
       score: 0.633     Luciferus Calaritanus -ca. 370 (1570145857019022921123)
       score: 0.514     Lucifer Calaritanus gest. 370 n. Chr. (798145857991023021603)
       score: 0.417     Lucifer, Bp. of Cagliari, d. ca. 370 (64799542)
      action: perfection achieved (updated name and id)

If I get many hits, and none of them are perfect, but the ratio is above a configured threshold (0.949), then that is good enough for me (even if the selected record is not the “best” one):

       query: Palanque, Jean-Remy
        hits: 5
       score: 0.950     Palanque, Jean-Rémy (106963448)
       score: 0.692     Palanque, Jean-Rémy, 1898- (46765569)
       score: 0.667     Palanque, Jean Rémy, 1898- (165029580)
       score: 0.514     Palanque, J. R. (Jean-Rémy), n. 1898 (316408095)
       score: 0.190     Marrou-Davenson, Henri-Irénée, 1904-1977 (2473942)
      action: perfection achieved (updated name and id)

By exploiting the Levenshtein algorithm, and by learning from the good work of others, I have been able to programmatically select VIAF identifiers for more than half of my authority records. When one has as many as 120,000 records to process, this is a good thing. Moreover, this use of the Levenshtein algorithm seems to produce more complete results when compared to the VIAF AutoSuggest API. AutoSuggest identified approximately 20 percent of my VIAF identifiers, while my Levenshtein algorithm/logic identifies more than 40 or 50 percent. AutoSuggest is much faster though. Much.

Fun with the intelligent use of computers, and think of the possibilities.

[0] VIAF Finder – http://infomotions.com/blog/2016/05/viaf-finder/

[1] Levenshtein – http://bit.ly/1Wz3qZC

[2] reconciliation service – https://github.com/codeforkjeff/refine_viaf

Published by

Eric Lease Morgan

Artist- and Librarian-At-Large

One thought on “Achieving perfection”

  1. Thank you for your kind words about my refine_viaf project!

    Very interesting to read about your success rate compared with VIAF’s AutoSuggest. I wonder if AutoSuggest is optimized for fuzzy matches, making it work less well with names that are already fairly close matches. I know I’ve experienced odd results with ElasticSearch and Apache Solr if you don’t specifically tune things for the data you’re working with. If VIAF uses one of those search products, that might explain it.

    I’m so pleased to have discovered your incredibly thoughtful blog. You have a new regular reader.

Comments are closed.