In the field of librarianship, I think the preservationists have the most challenging job because it is fraught with the greatest number of unknowns.
Twenty-eight (28) CDs
As I am writing this posting, I am in the middle of an annual processes — archiving the data I created from the previous year. This is something I have been doing since 1986. It began by putting my writings on 3.5 inch “floppy” disks. After a few years, CDs became more feasible, and I have been using them ever since. The first few CDs contain multiple years’ worth of content. This year I will require 14 CDs, and considering the fact that I create duplicates of every CD, this year I will burn 28. It goes with too much saying, this process takes a long time.
Now, I’m not quite a prolific a writer as 28 CDs sound, but the type of content I archive is large and diverse. It begins with my email which I have been systematically collecting since 1997. (“Can you say, ‘Mr. Serials’?”) No, I do not have all of my email, just the email I think is important; email of a significant nature where I actually say something, or somebody actually says something to me. It includes some attachments in the form of PDF documents and image files. It includes, inquiries I get regarding my work and postings to mailing lists that are longer rather than shorter. By the way, I only send plain text email messages because MIME encodings — the process used to include other than plain text content — adds an extra layer of complexity when it comes to reading and parsing email (mbox) archives. How can I be sure future digital archeologists will be able to compute against such stuff? Likewise, nothing gets tape archived (“tarred”), and nothing gets compressed (“zipped”) for all for the same reasons — an extra layer of complexity. Since I am the “owner” of the Code4Lib, NGC4Lib, and Usability4Lib mailing lists, and since was used to be the official archivist for ACQNET, I systematically collect, organize, archive, index, and provide access to these mailing lists using Mr. Serials. Burning the raw (mbox) email files of these lists as well as their browsable HTML counterparts is a part of my annual email preservation process.
The proces continues with the various types of other writings. Each presentation I give has its own folder complete with invitation, logistics, bio & abstract, as well three versions of my presentation: 1) a plain-text version, a one-page handout in the form of a PDF file, and a Word document. (Ick!) If I’m lucky I will remember to archive the TEI version of my remarks which is always longer than one page long and lives in the Musings section of Infomotions. Other types of writings include the plain text versions of blog postings, various versions of essays for publication, etc. At the very least, everything is saved as plain text. Not Word. Not PDF. Not anything that is platform or software-title specific. Otherwise I can’t guarantee it will be readable into the next decade. I figure that if someone can’t read a plain text file, then they have much bigger problems.
Then there is the software. I write lots of software over the period of one year. At least a couple dozen programs. Some of them are simple hacks. Some of them are “studies”, experiments, or investigations. Some of them are extensive intermediaries between relational databases and people using Web browsers. While many of these programs come to me in bursts of creative energy, I would not have the ability to recreate them if they were lost and gone to Big Byte Heaven. When it comes to computers, your data is your most important assest. Not the hardware. Not the software. The data — the content you create. This is the content you can not get back again. This is the content that is unique. This is the content that needs to be backed up and saved against future calamity.
Because some of my data is saved in relational databases, the annual preservation process includes raw database dumps. Again, these are plain text files but in the form of SQL statements. Thank God for mysqldump. It gives me the opportunity to restore my Musings, my blog, my Alex Catalogue, my water collection, and now my Highlights & Annotations. (More on that later.)
All of the content above fits on a single CD. Easily. Again, I’m not that prolific of a writer.
The hard part is the multimedia. As a part of an Apple Library of Tomorrow grant awarded to me by Steve Cisler, I was given an Apple QuickTake camera in 1994 or so. It could store about 24 pictures in 256 colors. It broke when my wife accidentally dropped it into a pond. It still works, if you have the necessary Macintosh hardware and it is plugged in. Presently, I use a 5 megapixel camera. I take the pictures at the highest resolution. I take movies as well. The pictures get edited. The movies get edited as well. This content currently makes up the bulk of the CDs. Six for the movies saved in the Apple movie (.mov) format. One DVD for actual use. Three for the full-scale JPEG images. Three for the iPhoto CDs. While I feel confident the JPEG files will be readable into the future, I’m not so sure about the .mov files, let alone the DVD. I might feel better about some sort of MPEG format, but it seems to be continually changing. Similarly, I suppose I ought to be saving the JPEG files as PNG files. At least that way more of the metadata may be traveling along with the images. For even better preservation, I ought to be putting the movies on video tape. (There is no compression or encryption there). I ought to be printing the photographs on glossy paper and binding the whole lot into books.
This year I started saving my music. I’ve been recording myself playing guitar since 1984. It began with audio cassette tapes. I have about 30 of them labeled and stored away in plastic boxes. I’ve made a couple attempts to digitize them, but the process is very laborious. It is easier to record yourself digitally in the first place and save the resulting files. This year a rooted through my archives and found a number of recordings. Tests of new recording gear and software. Experiments in production techniques. Background music to home videos. Saved as AIFF files, I hope they will be readable in the future.
Once everything gets burnt to CDs, one copy becomes my working copy. The other copy goes to a CD case not to be touched. Soon I will need a new case.
Finally, everything is not digital. In fact, I print a lot. Print that thought-provoking email message. Print that essay. Print this blog posting. Print the code to that computer program. Sign and date the print out. Put it into the archival box. The number of boxes I’m accumulating is now up to about 10.
What can I say. I enjoy all aspects of librarianship.
My world of (digital) preservation is miniscule compared to work of academic preservationists, archivists, and curators. If it takes this much effort to systematically collect, organize, and archive one person’s content, then think how much effort would be required to apply the process against the intellectual output of an entire college or university!
U of MN Archive
Even if so much people-power were available, this is no insurance against the future. How do we go about preserving digital content? What formats should the content be manifested in? What hardware will be needed to read the media where the data is saved? What software will be necessary to read the data? Too many questions. Too many unknowns. Too many things that are unpredictable. Right now, there only seems to be two solutions, and the real solution is probably a combination of the two. First, make sincere efforts to copy non-proprietary formats of content to physical media — a storage artifact that can be read by the widest variety of computer hardware. Plan on migrating the content as well as the physical media forward as technology changes. Think this process as an a type of insurance. Second, make as many copies of the content as possible in as many formats as possible. Print it. Microfilm it. Put it on tape and spinning disks. Make it available on the Web. While the folks at LOCKSS may not have thought the expression would be used in this manner, it is still true — “Lot’s of copies keep stuff safe.”
I sincerely believe we are in the process of creating a Digital Dark Age. “No, you can not read or access that content. It was created during the late 20th and early 21st centuries. It was a time of prolific exploration, few standards, and many legal barriers.” Something needs to happen differently.
Maybe it doesn’t really matter. Maybe the content that is needed is the content that always lives on “spinning disks” and gets automatically migrated forward. Computers make it easier to create lots of junk. It certainly doesn’t all need to be preserved. On the other hand, those letters from the American Civil War were not necessarily considered important at the time. Many of them were written by unknown people. Yet, these letters are important to us today. Not because of who wrote them, but because they reflect the thinking of the time. They provide pieces of a puzzle that can verify facts or provide alternative perspectives. After years and years, information can grow in importance, and consequently, today, we run the risk of throwing away stuff this is of importance tomorrow.
Preservationists have the hardest job in the field of librarianship. More power to them.