4 Responses to “Automatic metadata generation”

  1. Nathan says:

    Eric,

    Interesting. When you say “Third, automatically generated keywords and phrases were many times just as useful as the librarian-assigned Library of Congress Subject headings.”

    I grow concerned though, because the strings:

    Universalism United States History
    Unitarian Universalist churches United States

    …seem to me to be much more fulsome and useful. I’m not saying your automatically generated list isn’t useful, or that it isn’t complementary to LCSH – I’m just concerned with your phrase “just as useful” which, pardon me – seems to give you away as someone who, when weighing the options, will be more than willing to let LCSH go when the time comes.

    And that, I think, is not going to serve us well.

    ~Nathan

    Many of the items harvested from the Internet Archive were complete with MARC records. Some of those records included subject headings. During Step #5 (above), I spent time observing the output and comparing it to previously assigned terms. Take for example a work called Universalism in America: A History by Richard Eddy. Its assigned headings included:

    Universalism United States History
    Unitarian Universalist churches United States

  2. Karen Coyle says:

    Eric, have you looked at the work done at NLM on using generated metadata as part of the subject cataloging process? They found that presenting catalogers with a list of machine-generated subjects was helpful, especially for the less experienced catalogers. Info at: http://ii.nlm.nih.gov/.

  3. Tom Burton-West says:

    Hi Eric,

    You might want to take a look at the open source Kea software, which does something similar (It uses tf*idf and some other heuristics).

    http://www.nzdl.org/Kea/

    Tom

  4. Thank you for the useful comments, and rest-assured, I have not been ignoring you.

    Yes, an additional next step would be to map the automatically generated subject “tags” to Library of Congress Subject Headings or some other controlled vocabulary. This has been done a number of times previously with various degrees of success, but compared to the other problems at hand — such as making the documents themselves more useful and interrelated — I plan to forgo this option at the present time. I understand the value of a controlled vocabulary, and that is why I have retained it whenever possible.