Open source software in libraries

Introduction

This is an essay about open source software and libraries. It outlines what open source software is and is not. It discusses its relationships to the integrated library system. It compares open source software to open access journals and the evolutionary shift academe is experiencing in the world of scholarly communication. Finally, it very briefly reviews select pieces of open source software and describes how they can be used in libraries.

Open source software as a philosophy and a process

Open Source Software (OSS) is both a philosophy and a process. As a philosophy it describes the intended use of software and methods for its distribution. Depending on your perspective, the concept of OSS is a relatively new idea, being only four or five years old. On the other hand, the GNU Software Project -- a project advocating the distribution of "free" software -- has been operational since the mid '80s. Consequently, the ideas behind OSS may have been around longer than you think.

It begins when a man named Richard Stallman who worked for MIT in an environment where software was shared. In the mid 1980s, Stallman resigned from MIT to begin developing GNU -- a software project intended to create an operating system much like Unix. (GNU is pronounced "guh-NEW" and is a recursive acronym for GNU's Not Unix.) His desire was to create "free" software, but the term "free" should be equated with freedom, and as such people who use "free" software should be: 1) free to run the software for any purpose, 2) free to modify the software to suit their needs, 3) free to redistribute of the software gratis or for a fee, and 4) free to distribute modified versions of the software.

In other words, in the context of GNU software, the term "free" should be equated with the Latin word "liberat," meaning to liberate, and not necessarily "gratis", meaning without return made or expected. In the words of Stallman, we should "think of 'free' as in 'free speech,' not as in 'free beer.'"

At the height of the dot-com boom the phrase "open source software" was coined. As the story goes, a number of people from RedHat, a company that sells Linux distributions as well as support, were sitting around one day trying to figure out how to market their products and services. The idea of "free" software was, and still is, a difficult idea for many people to understand. Consequently they were trying to come up with a new phrase that conveyed the idea of free software without using the word free. Since it is the source code is what is given a way and people were encouraged to read and modify the source code, the phrase "open source software" selected. It stuck.

The process of creating and maintaining open source software revolves around a communication process akin to the scholarly communications process of academia. The process begins with a programmer's "itch", a problem the programmer wants to solve. To continue with the metaphor, the programmer then "scratches their itch" by writing a computer program. They are proud of their creation, and they share it with sets of their peers. Through this initial communication process, others become aware of a possible solution to their own problems. Soon a community develops as many people begin to use the software. Someone, whose problems are similar but different from the initial problem, decides to enhance the original program. This enhancement is given back to the original programmer. If the enhancement is not detrimental to the original concept of the program, then the enhancement is often incorporated into the program, the new piece of software is redistributed, and the process begins anew.

The process is similar to the scholarly communications process because open source software goes through a sort of peer review process. Fellow programmers examine a program's source code, find flaws, and suggest improvements. Eric Raymond, author of a book called The Cathedral and the Bazaar, describes this process in detail and posits that "given enough eyeballs, all bugs are shallow."

Costs of open source software

People often advocate open source software because it is free. While you will not pay for the source code directly, open source software is only "as free as a free kitten".

Suppose you were offered a free kitten. It is soft. It purrs. It plays with a ball of string. Cute and adorable, you take it home. First you buy a collar. Then you buy food and a food bowl. Next you take it to the veterinarian and they charge you a fee for shots. Alas, the kitten starts to cost you money. Moreover, the kitten escapes outdoors. It is lost overnight and you worry yourself sick. Not only have you invested time and money into your "free" kitten, but you have also invested emotional energy. Free kittens do not come without costs.

The same is true of commercial as well as open source. Both types of software cost time and money to install, configure, and implement. Training costs may be involved in learning how to use either type of software. Technical support may be included in the up front costs of the commercial software. Similarly, it is quite possible to purchase support from open source software vendors or third parties. Once the software is up an running, an institution will spend emotional energy and become attached to particular features of their implementations. These are all real costs.

The differences in costs between commercial software and open source software is two-fold. First, open source software does not include the up-front costs of commercial software. With open source software you get to "try before you buy", and you get to do this with a full-blown version of the product. No time-limited trials. No lack of documentation. No crippled features. You have the opportunity to see exactly what you are getting.

The second cost includes support costs. Commercial software will offer support, maybe for an extra fee. Maybe not. Most open source software does not come with formal support; there is rarely someone you can call on the telephone. Sometimes you can buy support from the vendor or third parties. If you want support you are expected to ask for help through online forums such as mailing lists or discussion forums. Since the original developers have personal stakes in the success of their application, it is quite likely they will be participating in the discussions and provide advice. If the particular piece of open source software is popular, then it is likely others will provide support. You, as an individual or institution, are expected to implement your own changes, customizations, or enhancements.

The time spent implementing the changes, customizations, and enhancements are real costs in both commercial software and open source software, but I contend that such costs are more akin to investments in personnel when it comes to open source software. For the most part, open source software is very standards compliant. There are few proprietary "enhancements" to standard file formats and protocols. Consequently, implementing changes in open source software is time spent learning skills that are transferable from computer program to computer program. The software skills applied against commercial software are more likely to be application-specific. Since open source software is more likely to be standards compliant, it is usually an easy task to export your data from one system and import it into another without having to remove or overcome the proprietary aspects of the data. The personnel costs associated with open source software are really investments in the institution. Institutions that make this investment will be effective at managing the change and risk of their computing environments.

Open source software and the integrated library system

Libraries are a lot about the collection, storage, organization, dissemination, and sometimes evaluation of information and knowledge. With the advent of computer technology in libraries many of these processes have been implemented through a library's "integrated library system" or ILS. The primary purpose of the ILS seems to be the management of lists of MARC records and the facilitation of services against these lists. The online public access catalog (OPAC) provides searching functions against the list. Cataloging provides functions to add and edit items on the list. Acquisitions provides some accounting functions. Reserve room modules, circulation modules, and interlibrary loan modules allow the locations of items to be temporarily moved from one place to another. Serials modules provide functions for inventory control.

Problems appear when the ILS does not keep up with changing expectations or does not function the way a library desires. Suppose a librarian wants to create a statistical report against the library's holdings. They want to know how much money was spent on books classified as science materials. If the ILS does not support this sort of function then the only recourse is to ask the vendor for an upgrade. If your ILS is implemented as a relational database, then a competent database administrator should be able to read the database's entity relationship diagram and extract the necessary information. But this is only possible of the vendor supplies you information about their database. Alternatively, suppose you wanted to provide a "virtual new bookshelf" allowing people to browse new acquisitions on a regular basis. If this is not an explicit function of your ILS, then you must find a work around, and it is likely to be specific to your particular ILS. Patrons' expectations are changing too. The Google Did You Mean? service is very popular. The Amazon.com People Like You Also Read service seems to be popular too. What can we do to incorporate these things into our systems if they are not initially a part of its makeup?

The problems are compounded when we realize that much of the information our patrons desire is not accessible in our ILS at all but through full-text and citation indexes. The catalog has traditionally been defined as an inventory list of the things a library physically owns or, now-a-days, licenses. These licenced materials are often times full-text or citation indexes to journal literature. People want to find an article on a particular topic, say, global warming. As librarians we must train patrons not to look in the catalog for such things but in a journal article index. Patrons find this increasingly difficult to understand since their experiences are driven by Google. One box. One button. Lot's of stuff. "Library why can't you do that?" Federated search engines are a hot topic these days. People are expecting simple search interfaces and one-stop shopping. Since much of this content is not owned by libraries but only licensed, libraries have a difficult time creating truly seamless access to the variety of licensed content as well as the content from our catalogs.

Short-term solutions to the problems are really "hacks" compounding the problem. These solutions include the use of traditional Z39.50 connections between computers. This solution is not great because Z39.50 is not universal and inconsistently implemented. Other solutions include "screen scraping" techniques were HTML pages are received by a program and the necessary information is extracted. This technique breaks as soon as the remote service changes its interface.

If libraries, as institutions, are willing to take more responsibility for their computing environments, then open source software techniques can play a role in filling these gaps. For example, there exist a number of open source tools allowing people to create and edit MARC records. The Perl module named MARC::Record is the most popular. Given this tool and an indexer/search engine (I currently endorse swish-e), the process of creating a virtual new bookshelf is almost trivial:

Insert dates into bibliographic records denoting items' availability to the public.
On a regular basis, extract all the MARC records from your ILS that are less than or equal to a particular date. This is your set of "new" items.
Use something like MARC::Record to extract the data/information from the records people desire to know (title, author, subject, notes, etc.) and save the extracted information as sets of HTML files.
Provide a browsable interface to the sets of HTML files.
Index the HTML files.
Provide an interface to search the index.
Go to Step #1.

What's really nice about this algorithm is its vendor independency. As long as you can extract sets of MARC records from your ILS, then you can provide this sort of functionality. Even if MARC::Record or swish-e go away, there will be other tools available providing similar functionality.

The problem with this solution is the fact that most libraries, as institutions, do not have the necessary computing expertise to make the solution a reality. There does not seem to be a critical mass of people working in libraries who know how to write computer programs. Consequently library processes and computing environments are often held hostage by library-specific software vendors.

Open source software and open access journals

I assert that the same cultural factors and economic pressures that are making open source software a viable option are the same factors and pressures that are making open access journal literature more appealing.

As you know, the prices of scholarly journals have been increasing at rates much higher than inflation. If my salary had increased by the same rate as scholarly journals I would have doubled my salary more than five years ago and doubled it again since then. The problem is compounded by at least two other things. First, there is a shrinking number of publishers since the smaller publishers are getting bought by the bigger ones. Second, it is considered in very bad taste to publish the same article in more than one journal. The shrinking number of publishers combined with the veritable monopoly writers grant publishers makes for higher prices.

This environment was considered okay as long as the prices did not get out of control, but the prices are now out of control. Because of these high prices fewer libraries (and therefore people) have access to this literature. Many of the libraries that still subscribe to these journals are doing so through electronic-only means. No print issues are delivered, and libraries are really licencing permission to access the information, not download, archive, or keep it. When subscriptions lapse, access is gone.

For-profit publishers who license their content and do not make it available for wholesale downloading and archiving are similar to commercial software vendors who do not open up their source code. When scholarly materials are widely distributed and archived, then the historical record is preserved. In the model we see becoming more prevalent, scholarly materials are housed in at the central location of a publisher in an unknown format. How is a person to know that this information will not be changed or inadvertently deleted? If access is restricted, how are we to "fix bugs" in the literature? What happens to the content if for some reason the publishers go out of business?

Open access journal literature might have other problems, but it doesn't have these. Open access journal literature is freely disseminated. It is mirrored and archived around the world. The authenticity of any open access journal article can easily be verified by comparing it to versions from other archives. People require unfettered access to information (read software) in order to build on the good work of others. While nobody wants to deny the ability of people to make a living, this living should not come at the cost of making it more difficult to improve the human condition. For the most part, selling physical things like paper journals or automobiles is considered good business. On the other hand, selling information, while it is never free, does not seem to go over well in democratic societies. Open access journal literature, just like open source software, should make it easier to improve the human condition. After all, aren't both things intended to expand our knowledge and improve our lives?

Short reviews of selected pieces of open source software

There are many pieces of open source software directly relevant to the on going work in libraries. The first few listed here are general purpose application. The following set are library-specific.

Apache - Apache is the most popular Web (HTTP) server on the Internet and a standard open source piece of software. It's name doesn't have anything to do with American Indians. Instead, it's name comes from the way it is built. It is "a patchy" server, meaning that it is made up of many modular parts to create a coherent whole. This design philosophy has made the application very extensible. Apache is currently at version 2.0, but for some reason many people are still using the 1.3 series. I don't really know why. I have not upgraded my Apache servers to version 2.0 because I do not want to loose the functionality of AxKit, an XML transformation engine. (http://httpd.apache.org/)
Linux - Linux has come to be the archetypical piece of open source software. It is an operating system just as Windows is a operating system. It is software that handles the most basic operations of a computer like keyboard input, display via monitors, reading and writing to the hard drive, and getting input/output from network devices. Linux came about when a college student, Linus Torvolds, desired Unix (another operating system) functionality on this inexpensive Intel-based hardware. As he wrote his application, he shared it with others, and people added new functionality. To date Linux is one of the most powerful operating systems ever created and definitely getting better all the time. There seems to be little doubt that Microsoft feels at least some threat by this new kid on the block, and that it is given away for "free."
MySQL - MySQL is a relational database application, pure and simple. Billed as "The World's Most Popular Open Source Database" MySQL certainly has a wide support in the Internet community. MySQL runs on just about any computing platform and it has been used to manage millions of records and gigabytes of data. Fast and robust, it supports the majority of people's relational database needs. If there were one technical skill I could teach the library profession, it would be the creation and maintenance of relational databases, and I would teach them how to use MySQL. (http://www.mysql.com/)
Perl - Perl is a programming language. Originally written to handle various systems administration tasks, Perl's strength lies in its ability to manipulate text. Perl matured through the era of Gopher but really started becoming popular with the advent to CGI scripting. Perl has been ported to just about any computer operating system, has one of the largest numbers of support forums, and has been written about in more books than you can count. Perl is mature and very robust. Other very good programming languages exist and can do much of what Perl can do. Examples include other "P" languages such as PHP and Python. These languages are becoming increasingly popular, especially PHP, but at the risk of starting a religious war, I advocate Perl because of its very large support base and its cross-platform functionality. (http://www.perl.com/)
swish-e - Swish-e is an uncomplicated indexer/search engine. Swish-e indexes individual files on a file system, files retrieved by crawling a website, or a stream of content from another application such as a database. The indexing half of swish-e is able to index specifically marked up text in XML and HTML as fields for searching later. The same application that creates the indexes can be used to search the indexes. Swish-e supports relevance ranking, Boolean operations, right-hand truncation, field searching, phrase searching, freetext searching, and nested queries. It's inherently open nature allows for the creation of some very smart search engines supporting things like spelling correction, thesaurus intervention, and "best bets" implementations. Of all the different types of information services librarians provide, access to indexes is one of the biggest. With swish-e librarians could create their own indexes and rely on commercial bibliographic indexers less and less. (http://www.swish-e.org/)
xsltproc and xmllint - Xsltproc and its companion program, xmllint, are very useful applications for parsing and processing XML file. By feeding xsltproc an XSL style sheet and an XML data file you can transform the XML data file into any one of a number of text files whether they be SQL, (X)HTML, tab-delimited files, or even plain text files intended for printing. Xmllint is a syntax checker. Given an XML file, xmllint will check the validity of your XML files. With xsltproc and a plain o' text editor, you can learn a whole lot about XML. (http://xmlsoft.org/XSLT/)

The following is a list of more library-specific open source software distributions.

DSpace - DSpace is a tool designed to allow institutions, such a libraries to collect, archive, index, and disseminate the scholarly and intellectual efforts of a community. Written with a combination of technologies by MIT, it is primarily used to capture bibliographic information describing articles, papers, theses, and dissertations. Once entered into the system, DSpace indexed the content and provide a way to link to the originals. DSpace plays well with open standards such as XML and OAI-PMH. If a large number of institutions of higher education where to capture their intellectual output using DSpace or some other similar piece of software, then access to scholarly materials would be greatly increased and readily available. (http://www.dspace.org/)
Greenstone - Greenstone is a tool for creating and managing digital library collections. Running on Windows as well as various flavors of Unix, it provide the means to easily create searchable and browsable interfaces to digital library collections via the Web. It also enables implementors to save their collections to CDs. Thus the digital library collections can be distributed to people with poor or not Internet access. Greenstone knows how to create collections from "standard" file formats such as HTML files, email messages, PDF documents, JPEG and GIF images, Word documents, as well as plain text files. If the sets of files are well structured, then Greenstone will create things like A-Z list of resources, and field searchable interfaces. Greenstone's look and feel can be customized through an HTML-like template language. The mailing provide more than adequate support and the documentation is through. (http://www.greenstone.org/)
Koha - Koha is an integrated library system with a growing user community. Written in Perl and using MySQL as the underlying database, Koha makes it simple to create and manage a small to medium-sized integrated library system. Equipped with acquisitions, cataloging, circulation, and searching modules it provides much of the functionality of traditional online catalogs. With the recent implementation of its Z39.50 interface, it is easy to enter ISBN numbers into the system, locate MARC records, and have those records added. The user and system interfaces are simple and unencumbered, but alas, not very customizable. For many libraries, the catalog is the center piece of the operation. Koha represents a major step in providing a catalog that is functional and usable. As long as support continues, I expect Koha to be more viable option for larger library collections. The obstacle is not technology. The obstacle is time and effort. (http://www.koha.org/)
MARC::Record - This Perl module is the tool to use when reading and writing MARC records. It is very well supported on the Perl4Lib mailing list, and a testament to the module's abilities is its incorporation into things like Koha and Net::Z3950. If you are not familiar with object oriented programing techniques, then MARC::Record might take a bit of getting used to. On the other hand, learning to use MARC::Record will not only improve your programming abilities but it will educate you on the intricacies of the MARC record data structure, a structure that was designed in an era of scarce disk space, non-relational databases, and little or no network connectivity. (http://marcpm.sourceforge.net/)
MyLibrary - MyLibrary is a user-driven, customizable interface to sets of library resources -- a portal. Technically, MyLibrary is a database-driven website application written in Perl. It requires a relational database application as a foundation, and it currently supports MySQL and PostgreSQL. MyLibrary grew out of a number of focus group interviews where people said they were suffering from information overload. To address this problem, MyLibrary takes three essential components of librarianship (resources, patrons, and librarians) and tries to create relationships between them through the use of common controlled vocabularies. Like a library catalog, MyLibrary provides the means to create collections of resources and classify them. Unlike a library catalog, the system also allows librarians as well as patrons to be classified in the same manner. By sharing a common set of controlled vocabulary terms relationships between resources, patrons, and librarians can be made thus facilitating things like, "If you are like this, then these resources may be of interest", or "If you have this interest, then your librarian is...", or "These people have expressed an interest this, therefore your patrons are...", or potentially even doing Amazon-like things such as "People like you also used...". (http://dewey.library.nd.edu/mylibrary/)
YAZ Toolkit - YAZ is a C library and resulting binary application implementing a Z39.50/SRW client. Zebra is an indexer and Z39.50 server. The yaz-client is a straight-forward terminal application. Zebraidx is the indexer, and requires bunches o' configuration files. It is not as straight-forward as other indexers, but its data can be served by zebrasrv. Since the client is built on a library, it can (and has) been compiled into other tools such as PHP and Apache. The YAZ API also has a Perl interface. YAZ/Zebra are definitely worth your time exploring if you want to make your collections available through Z39.50. Yes, you will spend time learning the in's and out's of Z39.50, but that experience can be taken forward and applied on other venues where Z39.50 is needed. (http://www.indexdata.dk/yaz/ and http://www.indexdata.dk/zebra/)

Open source software and libraries

The principles and practices of open source software are very similar to the principles and practices of modern librarianship. Both value free and equal access to data, information, and knowledge. Both value the peer review process. Both advocate open standards. Both strive to promote human understand and to make our lives better. Both make efforts to improve society as a whole assuming the sum is greater than the parts.

The use of open source software in libraries enables libraries to have greater control over their computing environments. Nobody is saying that all librarians should know how to compile relational database programs or debug Perl programs. On the other hand, it behooves libraries, as institutions, to know how to do this. If librarians want to be leaders in the fields of information and knowledge, then libraries need to know how to exploit the current technology that makes this happen. Open source principles, practices, and results can assist librarians in their fulfillment of day-to-day tasks as well as the goals of the profession.

Open source software represents a way for librarians to retain control over their computing environments instead of having their computer environments control them.

About the author

Eric Lease Morgan <emorgan@nd.edu> is a librarian at the University of Notre Dame, Notre Dame, Indiana, United States. His primary job is to help the Libraries implement and facilitate digital library services and digital library collections. He considers himself to be a librarian first and a computer user second. His professional goal is to discover new ways to improve library service through the use of computer technology. His was the original developer of MyLibrary but has been giving his software away for more than twenty years. He is also the maintainer of the Alex Catalogue of Electronic Texts at infomotions.com. In his spare time he can be seen folding defective floppy disks into intricate flaura and fauna.

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This is the pre-edited, English language version of the French article entitled "Logiciels libres et bibliotheques", BiblioAcid 1(2-3), May-June 2004, pgs. 1-8.
Date created: 2004-05-04
Date updated: 2004-12-12
Subject(s): articles; open source software; librarianship;
URL: http://infomotions.com/musings/biblioacid/