Today's digital information landscape

The main point of this lecture is to bring home a single idea, namely, the what of library and information science has not changed so much as the how. Libraries are still about the processes of collection, preservation, organization, dissemination, and sometimes evaluation of data and information. While the mediums, environments, and tools have dramatically changed, the problems and services the profession addresses remain the same. If we focus on our broader goals -- see the forest from the trees -- then the profession's future is bright offering us many opportunities. If we focus too much on the particulars, then libraries and librarians will be seen as increasingly irrelevant. The following examples will hopefully make this clear.

This presentation is also available as an abridged, one-page handout and a set of Powerpoint slides.

MARC and XML

MARC was a great technology that has all but outlived its usefulness. It was created in a time of computing scarcity and sequential access data files. Moreover, it has gone from a data structure used to print catalog cards to the fodder for entire library systems. MARC is a Gordian Knot that needs to be cut, and XML put into it's place.

For example, MARC is arbitrarily limiting. The first 5 characters of any MARC record is a left-hand, zero-padded string denoting the length of the record. Think about it. Given this characteristic, how long can the longest MARC record be? The answer is 99,999 characters. That might have been fine when you wanted to print no more than three for four catalog cards, but that limitation is behind us.

Each MARC records is divided into three sections: 1) leader, 2) directory, 3) bibliographic section. The directory is made up of many 12-digit strings denoting a MARC tag (such as 245), an offset from the beginning of the record, and a four-digit string denoting the length of the field. Again, given four digits, how long can any MARC field be? Answer: 9,999 characters. What if I want to include an extended abstract or summary in the record that is 10,000 characters long? No go.

There are other reasons why MARC is not the greatest technology for today's needs, but the most important has nothing to do with computing efficiency, language bias, or mathematical elegance. The most important reason is social. MARC is not a data structure familiar to any institutions beyond libraries. The rest of the world uses XML. If libraries are about disseminating data and information, then libraries need to speak the language of the intended audience. The audience understands XML but not MARC.

Databases and indexes

Databases are great for organizing and maintaining data but a poor technology for search. Indexes are great at search but really weak on maintenance. Use databases and indexes in conjunction with each other to create information systems. They are two sides of the same information retrieval coin.

The way to organize and maintain data in a digital environment is through the use of a database. Formulate rows and columns of data into tables of records. Use relational database techniques (normalization) to "join" common elements (keys) between tables, and thus improve data integrity and reduce redundancy. Through such a technique huge find/replace operations are almost unnecessary and intricate reports can be crafted through sufficient linking operations.

Unfortunately, to search databases the end user must know the structure of the database before hand. They need to know what tables and fields are available, what kind of data they contain, and what relationships they have with other fields. This is very tricky to do without access to an entity-relationship diagram and the ability to read it.

Indexes, on the other hand, excel at search. Enter a word or phrase, and a list of hits is returned. It is not necessary to specify fields. The user can get away with very little knowledge of syntax. Results can be returned in many different orders, most importantly relevancy ranked. Compared to databases, indexes are very easy to search. At the same time, you can not take advantage of normalization techniques using indexes; they are very poor to maintaining data.

A common technique for creating information systems is to combine databases and indexes. Use a database to organize and maintain the data. On a regular basis run a report from the database containing the content to be searched. Feed the report to an indexer to create an index. Provide an interface to search the index. Organizing and disseminating data and information are a lot of what libraries are about. In a digital environment, these are the tools do facilitate these tasks. We have organization and dissemination skills, we just do not exploit the best tools to accomplish these ends considering the current environment.

Exploiting the network and building relationships

Facebook, My Space, and LinkedIn are all the rage, and the reason they are is so popular is the networking. It is really easy to identify other people with your particular interests. Through tagging, reviewing, and sharing the people who spend time with social networking are building relationships between themselves and other people through common interfaces.

We, the library profession, can do this too as long as we try to put ourselves in the patrons' spaces. It is not as much about people physically visiting a library or even coming to a library website. It is more about making our content -- and expertise -- available at the time and place when the user's need it most. The jury is still out whether or not Second Life libraries, My Space pages, or Facebook accounts will be long-term library success stories. My opinion is they will not, only time will tell. Instead, through shared bookmarking services, syndication & integration of library content with other content, and personalized services patrons will create their own "libraries" meeting their individual needs. Libraries will be a part of these personal collections if and only if our content is easy to "mash up".

The global network of computers supporting social networking and mash ups is also one of the fundamental ingredients to successful of open source software projects. The heart of open source is a process of building networks of like-minded individuals who desire to solve common computing problems. The process is not without leadership, norms of behavior, and conflict. At the same time, it is not a highly structured nor centralized process. It relies on a loosely joined network of computers as well a loosely joined network of people.

Open source is not your father's "homegrown" integrated library system primarily because of the network, ubiquitous computing, and abundance of relatively sophisticated computing building blocks like standard operating systems, relational databases, and indexes. Librarians have a long tradition of collaboration, a necessary component in open source software development. In open source software everybody has something to offer. Users. Writers. Testers. Usability experts. Marketers. Systems administrators. Etc. Open source software can be a path to greater control over our computing environments. All we have to do is tweak the way we view the role of computers in our workplaces.

The process of computing has also taken advantage of the network. Web Services computing, in the form of SOAP and REST-ful applications, is the best example. AJAX is a nifty computing technique. OAI-PMH is a wonderful way of sharing metadata. Blogs, ATOM, and RSS lower the barrier for the sharing of ideas and publication. OpenSearch and SRU exploit the network to provide modern-day Z39.50-like search interfaces. Ask yourself how useful your computer is without it's Internet connection. The folks at Sun Microsystems surely got it right a number of years ago with their slogan, "The network is the computer."

A rising tide floats all boats. The tide of network computing is certainly upon us. Let's make sure our boats are in the water.

Institutional repositories and open access

Institutional repositories seem to be yet another reaction to the dramatic and never-ending price increases in scholarly literature. Believe it or not OAI-PMH was one of the first reactions. SPARC was another. Sprinkle the idea of open source on institutional repositories and you get open access. Institutional repositories and open access publishing activities are here to stay, but so is commercial publishing. Just as open source software is not going to replace commercial software, institutional repositories and open access publishing will live side-by-side their commercial counterparts.

I attended the Charleston Conference a few weeks ago. The majority of participants come from academic library acquisitions departments and academic publishing companies. Next to the topic of ebooks, open access was on everybody's mind. A common question was, "What are we going to do for a job when and if everything becomes open access?" Again, this sort of question focuses too much on the how of the profession and less on the what. Open access is (can be) an acquisitions librarian dream come true as long as you think of acquisitions as the process of bringing materials into a collection. Identify content. Bring it in locally. Organize and index it. Make it available and useful to the local constituents. Moreover, once the content is in hand and in digital form, there are a myriad of other value-added services libraries can provide against this content, outlined in the section below.

Acquisitions departments are not necessarily about buying content. If there were, then they would be working in the Purchasing Department. An acquisitions department is responsible for bringing collections into the library. Those things can be items from commercial publishers, open access sites, the hosting college/university, or the Web in general. How are you going to preserve the content if you don't bring it in locally?

"Next Generation" Library Catalogs

The traditional professional vision of the library catalog does not meet the needs of our current constituents. They appreciate the authority of its content. They trust the judgment of the librarian. But the tools the profession uses to provide access to its materials seem antiquated. Something needs to change. Access to library collections and services against them are the most visible network characteristics of libraries. What's next?

A "next generation" library catalog needs to include content beyond the content a library owns. In a networked environment, ownership is not as important. The catalog needs to be defined as the content needed by the students, instructors, and scholars necessary to do their learning, teaching, and research. While it will never be possible to acquire all the necessary content, it will include bibliographic data from books & journals, theses & dissertations, government documents, images, movies, sounds, data sets, etc. By combining metadata with full text it will be better able to perform relevancy ranking and identify obscure facts. Moreover, the full text will enable the system to provide the enhanced services outlined below. In the meantime federated/broadcast search will fill a gap but its promise will never be fulfilled for all the reasons we already know. Speed. Network latency. Dumbing down metadata. Screen scraping. Long term maintenance.

The most significant difference between traditional library catalogs and the "next generation" library catalog lies in: 1) the enhancement of the discovery process and 2) providing services against the collection beyond simple identify. Putting the users' needs and characteristics at the center of the query process will greatly enhance the discovery process. By knowing more about the searcher -- placing the query in context with the searcher -- it will be possible to improve find significantly. For example, if you know the searcher is a freshman, then it is safe to assume their experience or knowledge is less than a senior's and therefore a different set of resources may be appropriate for their needs.

Search can take experience into account and present results accordingly. Suppose the searcher is an expert in anthropology but are looking for information on micro-economics. Given this it is unlikely the searcher will want advanced micro-economic data, at least not right away. Present the results accordingly. Assume the searcher has a history of doing many micro-economic searches. Either they are not finding what they desire or they are looking for more specific information. Return search results accordingly.

Put another way, ask yourself questions about the searcher and tailer the search results. Who are they? What is their level of skill or education? What classes are they taking? What is their major. How old are they? Are they new to the subject or an expert? Who are their peers and what are they using? Use those resources as a guide. Do they want help? To what degree to they desire privacy? By knowing the answers to these sorts of questions search results can be tailored to meet individual needs; search can be put into the user's context.

More importantly, a "next generation" library catalog will provide services against the things discovered. These services can be enumerated and described with action statements including but not limited to: get it, add it to my personal collection, tag & classify it, review it, buy it, delete it, edit it, share it, link it, compare & contrast it, search it, summarize it, extract all the images from it, cite it, trace it, delete it. Each of these tasks supplement the learning, teaching, and research process. They are tools and processes our students, instructors, and researchers use to accomplish their individual goals. All of these processes are things libraries already support in our physical environment to some degree, but with the advent of the globally networked environment libraries need to figure out how to provide these services on people's computer desktops.

I like to summarize the ideas about the catalog in this way. "Collections without services are useless. Services without collections are empty. Library catalogs lie at the intersection of collections and services."

The computing model for such a system is not too difficult to illustrate. Aggregate content into one or more central stores. Index the content. Provide access to the index. Provide services against the found items. The keys to implementing this model lie in putting into practice three engineering principles and a few community/leadership activities.

The engineering principles begin with making small applications that do one thing and do one thing well; do not go about creating one huge system. Second, make sure your small parts work well with each other. This usually means supporting concepts called standard input, standard output, and standard error. Finally, use plain text as much as possible, not binary data, since plain text is a universal interface.

When it comes to community/leadership principles, employ community-driven standards as much as possible. This will enable modularity. Support experimentation, innovation, and play. This will enable your institution to be nimble, flexible, and more responsive to user needs. They allow institutions to be leaders as opposed to followers. They empower you to be more proactive. Understand that everybody has something to offer. A "next generation" library catalog is primarily not a computer problem as much as it is a library/university problem. By getting as many people involved as possible it will be easier to create a system that meets most people's needs.

Summary

The principles of collection, organization, preservation, and dissemination are extraordinarily relevant in today's digital landscape. The advent of the globally networked computers, Internet indexes, and mass digitization projects have not changed this fact. If anything, they highlight need for these processes even more. Libraries are just one of many players in the information universe. It is increasingly important to adapt to the changing landscape and at the same time bring new value to the collections and services we provide. It is not so much about the what we are doing. It is more about the how.

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This essay was originally written for a lecture at the University of North Texas (December 4, 2007)
Date created: 2007-12-01
Date updated: 2007-12-01
Subject(s): presentations; Denton, TX; librarianship;
URL: http://infomotions.com/musings/digital-landscape/