Archive for May, 2010

Cyberinfrastructure Days at the University of Notre Dame

Sunday, May 23rd, 2010

ci daysOn Thursday and Friday, April 29 and 30, 2010 I attended a Cyberinfrastructure Days event at the University of Notre Dame. Through this process my personal definition of “cyberinfrastructure” was updated, and my basic understanding of “digital humanities computing” was confirmed. This posting documents the experience.

Day #1 – Thursday, April 29

The first day was devoted to cyberinfrastructure and the humanities.

After all of the necessary introductory remarks, John Unsworth (University of Illinois – Urbana/Champagne) gave the opening keynote presentation entitled “Reading at library scale: New methods, attention, prosthetics, evidence, and argument“. In his talk he posited the impossibility of reading everything currently available. There is just too much content. Given some of the computing techniques at our disposal, he advocated additional ways to “read” material, but cautioned the audience in three ways: 1) there needs to be an attention to prosthetics, 2) an appreciation for evidence and statistical significance, and 3) a sense of argument so the skeptic may be able to test the method. To me this sounded a whole lot like applying scientific methods to the process of literary criticism. Unsworth briefly described MONK and elaborated how part of speech tagging had been done against the corpus. He also described how Dunning’s Log-Likelihood statistic can be applied to texts in order to determine what a person does (and doesn’t) include in their writings.

Stéfan Sinclair (McMaster University) followed with “Challenges and opportunities of Web-based analytic tools for the humanities“. He gave a brief history of the digital humanities in terms of computing. Mainframes and concordances. Personal computers and even more concordances. Webbed interfaces and locally hosted texts. He described digital humanities as something that has evolved in cycles since at least 1967. He advocated the new tools will be Web apps — things that can be embedded into Web pages and used against just about any text. His Voyeur Tools were an example. Like Unsworth, he advocated the use of digital humanities computing techniques because they can supplement the analysis of texts. “These tools allow you to see things that are not evident.” Sinclair will be presenting a tutorial at the annual digital humanities conference this July. I hope to attend.

In a bit of change of pace, Russ Hobby (Internet2) elaborated on the nuts & bolts of cyberinfrastructure in “Cyberinfrastructure components and use“. In this presentation I learned that many scientists are interested in the… science, and they don’t really care about the technology supporting it. They have an instrument in the field. It is collecting and generating data. They want to analyze that data. They are not so interested in how it gets transported from one place to another, how it is stored, or in what format. As I knew, they are interested in looking for patterns in the data in order to describe and predict events in the natural world. “Cyberinfrastructure is like a car. ‘Car, take me there.'” Cyberinfrastructure is about controls, security systems, storage sets, computation, visualization, support & training, collaboration tools, publishing, communication, finding, networking, etc. “We are not there to answer the question, but more to ask them.”

In the afternoon I listened to Richard Whaling (University of Chicago) present on “Humanities computing at scale“. Given from the point of view of a computer scientist, this presentation was akin to Hobby’s. On one hand there are people do analysis and there are people who create the analysis tools. Whaley is more like the later. I thought his discussion on the format of texts was most interesting. “XML is good for various types of rendering, but not necessarily so good for analysis. XML does not necessarily go deep enough with the encoding because the encoding is too expensive; XML is not scalable. Nor is SQL. Indexing is the way to go.” This perspective jives with my own experience. Encoding texts in XML (TEI) is so very tedious and the tools to do any analysis against the result are few and far between. Creating the perfect relational database (SQL) is like seeking the Holy Grail, and SQL is not designed to do full text searching nor “relevancy ranking”. Indexing texts and doing retrieval against the result has proven to be much more fruitful or me, but such an approach is an example of “Bag of Words” computing, and thus words (concepts) often get placed out of context. Despite that, I think the indexing approach holds the most promise. Check out Perseus under Philologic and Digital South Asia Library to see some of Whaley’s handiwork.

Chris Clarke (University of Notre Dame), in “Technology horizons for teaching and learning“, enumerated ways the University of Notre Dame is putting into practice many of the things described in the most recent Horizon Report. Examples included the use of ebooks, augmented reality, gesture-based computing, and visual data analysis. I thought the presentation was a great way to bring the forward-thinking report down to Earth and place it into a local context. Very nice.

William Donaruma (also from the University of Notre Dame) described the process he was going through to create 3-D movies in a presentation called “Choreography in a virtual space“. Multiple — very expensive — cameras. Dry ice. Specific positioning of the dancers. Special glasses. All of these things played into the creation of an illusion of three-dimensions on a two-dimensional space. I will not call it three-dimensional until I can walk around the object in question. The definition of three-dimensional needs to be qualified.

The final presentation of the day took place after dinner. The talk, “The Transformation of modern science” was given virtually by Edward Seidel (National Science Foundation). Articulate. Systematic. Thorough. Insightful. These are the sorts of words I use to describe Seidel’s talk. Presented remotely through a desktop camera and displayed on a screen to the audience, we were given a history of science and a description of how it has changed from single-man operations to large-group collaborations. We were shown the volume of information created previously and compared it to the volume of information generated now. All of this led up to the most salient message — “All future National Science Foundation grant proposals must include a data curation plan.” Seidel mentioned libraries, librarians, and librarianship quite a number of times during the talk. Naturally my ears perked up. My profession is about the collection, preservation, organization, and dissemination of data, information, and knowledge. The type of content to which these processes are applied — books, journal articles, multi-media recordings, etc — is irrelevant. Given a collection policy, it can all be important. The data generated by scientists and their machines is no exception. Is our profession up to the challenge, or are we too much wedded to printed, bibliographic materials? It is time for librarians to aggressively step up to the plate, or else. Here is an opportunity being laid at our feet. Let’s pick it up!

Day #2 – Friday, April 30

The second day centered more around the sciences as opposed to the humanities.

The day began with a presentation by Tony Hey (Microsoft Research) called “The Fourth Paradigm: Data-intensive scientific discovery“. Hey described cyberinfrastructure as the new name for e-science. He then echoed much of content of Seidel’s message from the previous evening and described the evolution of science in a set of paradigms: 1) theoretical, 2) experimental, 3) computational, and 4) data-intensive. He elaborated on the infrastructure components necessary for data-intensive science: 1) acquisition, 2) collaboration & visualization, 3) analysis & mining, 4) dissemination & sharing, 5) archiving & preservation. (Gosh, that sounds a whole lot like my definition of librarianship!) He saw Microsoft’s role as one of providing the necessary tools to facilitate e-science (or cyberinfrastructure) and thus the Fourth Paradigm. Hey’s presentation sounded a lot like open access advocacy. More Association of Research Library library directors as well as university administrators need to hear what he has to say.

Boleslaw Syzmanski (Rensselaer Polytechnic Institute) described how better science could be done in a presentation called “Robust asynchronous optimization for volunteer computing grids“. Like Hobby and Whaley mentioned (above), Syzmanski separated the work of the scientist and the work of cyberinfrastructure. “Scientists do not want to be bothered with the computer science of their work.” He then went on to describe a distributed computing technique for studying the galaxy — MilkyWay@home. He advocated cloud computing as a form of asynchronous computing.

The third presentation of the day was entitled “Cyberinfrastructure for small and medium laboratories” by Ian Foster (University of Chicago). The heart of this presentation was advocacy for software as a service (SaaS) computing for scientific laboratories.

Ashok Srivastava (NASA) was the first up in the second session with “Using Web 2.0 and collaborative tools at NASA“. He spoke to one of the basic principles of good science when he said, “Reproducibility is a key aspect of science, and with access to the data this reproducibility is possible.” I’m not quite sure my fellow librarians and humanists understand the importance of such a statement. Unlike work in the humanities — which is often built on subjective and intuitive interpretation — good science relies on the ability for many to come to the same conclusion based on the same evidence. Open access data makes such a thing possible. Much more of Srivastava’s presentation was about DASHlink, “a virtual laboratory for scientists and engineers to disseminate results and collaborate on research problems in health management technologies for aeronautics systems.”

Scientific workflows and bioinformatics applications” by Ewa Deelman (University of Southern California) was up next. She echoed many of the things I heard from library pundits a few years ago when it came to institutional repositories. In short, “Workflows are what are needed in order for e-science to really work… Instead of moving the data to the computation, you have to move the computation to the data.” This is akin to two ideas. First, like Hey’s idea of providing tools to facilitate cyberinfrastructure, Deelman advocates integrating the cyberinfrastructure tools into the work of scientists. Second, e-science is more than mere infrastructure. It also approaches the “services against text” idea which I have been advocating for a few years.

Jeffrey Layton (Dell, Inc.) rounded out the session with a presentation called “I/O pattern characterization of HPC applications“. In it he described how he used the output of strace commands — which can be quite voluminous — to evaluate storage input/output patterns. “Storage is cheap, but it is only one of a bigger set of problems in the system.”

By this time I was full, my iPad had arrived in the mail, and I went home.

Observations

It just so happens I was given the responsibility of inviting a number of the humanists to the event, specifically: John Unsworth, Stéphan Sinclair, and Richard Whaley. That is was an honor, and I appreciate the opportunity. “Thank you.”

I learned a number of things, and a few other things were re-enforced. First, the word “cyberinfrastructure” is the newly minted term for “e-science”. Many of the presenters used these two words interchangeably. Second, while my experience with the digital humanities is still in its infancy, I am definitely on the right track. Concordances certainly don’t seem to be going out of style any time soon, and my use of indexes is a movement in the right direction. Third, the cyberinfrastructure people see themselves as support to the work of scientists. This is similar to the work of librarians who see themselves supporting their larger communities. Personally, I think this needs to be qualified since I believe it is possible for me to expand the venerable sphere of knowledge too. Providing library (or cyberinfrastructure) services does not preclude me from advancing our understanding of the human condition and/or describing the natural world. Lastly, open source software and open access publishing were common underlying themes but not rarely explicitly stated. I wonder whether or not the the idea of “open” is a four letter word.

About Infomotions Image Gallery: Flickr as cloud computing

Saturday, May 22nd, 2010

Infomotions Image GalleryThis posting describes the whys and wherefores behind the Infomotions Image Gallery.

Photography

I was introduced to photography during library school, specifically, when I took a multi-media class. We were given film and movie cameras, told to use the equipment, and through the process learn about the medium. I took many pictures of very tall smoke stacks and classical-looking buildings. I also made a stop-action movie where I step-by-step folded an origami octopus and underwater sea diver while a computer played the Beatles’ “Octopuses Garden” in the background. I’d love to resurrect that 16mm film.

I was introduced to digital photography around 1995 when Steve Cisler (Apple Computer) gave me a QuickTake camera as a part of a payment for writing a book about Macintosh-based HTTP servers. That camera was pretty much fun. If I remember correctly, it took 8-bit images and could store about twenty-four of them at a time. The equipment worked perfectly until my wife accidentally dropped it into a pond. I still have the camera, somewhere, but it only works if it is plugged into an electrical socket. Since then I’ve owned a few other digital cameras and one or two digital movie cameras. They have all been more than simple point-and-shoot devices, but at the same time, they have always had more features than I’ve ever really exploited.

Over the years I mostly used the cameras to document the places I’ve visited. I continue to photograph buildings. I like to take macro shots of flowers. Venuses are always appealing. Pictures of food are interesting. In the self-portraits one is expected to notice the background, not necessarily the subject of the image. I believe I’m pretty good at composition. When it comes to color I’m only inspired when the sun is shining bright, and that makes some of my shots overexposed. I’ve never been very good at photographing people. I guess that is why I prefer to take pictures of statues. All things library and books are a good time. I wish I could take better advantage of focal lengths in order blur the background but maintain a sharp focus in the foreground. The tool requires practice. I don’t like to doctor the photographs with effects. I don’t believe the result represents reality. Finally, I often ask myself an aesthetic question, “If I was looking through the camera to take the picture, then did I really see what was on the other side?” After all, my perception was filtered through an external piece of equipment. I guess I could ask the same question of all my perceptions since I always wear glasses.

The Infomotions Image Gallery is simply a collection of my photography, sans personal family photos. It is just another example of how I am trying to apply the principles of librarianship to the content I create. Photographs are taken. Individual items are selected, and the collection is curated. Given the available resources, metadata is applied to each item, and the whole is organized into sets. Every year the newly created images are archived to multiple mediums for preservation purposes. (I really ought to make an effort to print more of the images.) Finally, an interface is implemented allowing people to access the collection.

Enjoy.

orange hot stained glassTilburg University sculpturecoastal homebeach sculpturemetal bookthistleDSCN5242Three Sisters

Fickr as cloud computing

This section describes how the Gallery is currently implemented.

About ten years ago I began to truly manage my photo collection using Apple’s iPhoto. At just about the same time I purchased an iPhoto add-on called BetterHTMLExport. Using a macro language, this add-on enabled me to export sets of images to index and detail pages complete with titles, dates, and basic numeric metadata such as exposure, f-stop, etc. The process worked but the software grew long in the tooth, was sold to another company, and was always a bit cumbersome. Moreover, maintaining the metadata was tedious inhibiting my desire to keep it up to date. Too much editing here, exporting there, and uploading to the third place. To make matters worse, people expect to comment on the photos, put them into their own sets, and watch some sort of slide show. Enter Flickr and a jQuery plug-in called ColorBox.

After learning how to use iPhoto’s ability to publish content to Flickr, and after taking a closer look at Flickr’s application programmer interace (API), I decided to use Flickr to host my images. The idea was to: 1) maintain the content on my local file system, 2) upload the images and metadata to Flickr, and 3) programmatically create in interface to the content on my website. The result was a more streamlined process and a set of Perl scripts implementing a cleaner user interface. I was entering the realm of cloud computing. The workflow is described below:

  1. Take photographs – This process is outlined in the previous section.
  2. Import photographs – Import everything, but weed right away. I’m pretty brutal in this regard. I don’t keep duplicate nor very similar shots. No (or very very few) out-of-focus or poorly composed shots are kept either.
  3. Add titles – Each photo gets some sort of title. Sometimes they are descriptive. Sometimes they are rather generic. After all, how many titles can different pictures of roses have? If I were really thorough I would give narrative descriptions to each photo.
  4. Make sets – Group the imported photos into a set and then give a title to the set. Again, I ought to add narrative descriptions, but I don’t. Too lazy.
  5. Add tags – Using iPhoto’s keywords functionality, I make an effort to “tag” each photograph. Tags are rather generic: flower, venus, church, me, food, etc.
  6. Publish to Flickr – I then use iPhoto’s sharing feature to upload each newly created set to Flickr. This works very well and saves me the time and hassle of converting images. This same functionality works in reverse. If I use Flickr’s online editing functions, changes are reflected on my local file system after a refresh process is done. Very nice.
  7. Re-publish to Infomotions – Using a system of Perl scripts I wrote called flickr2gallery I then create sets of browsable pages from the content saved on Flickr.

Using this process I can focus more on my content and less on my presentation. It makes it easier for me to focus on the images and their metadata and less on how the content will be displayed. Graphic design is not necessarily my forte.

Flickr2gallery is a suite of Perl scripts and plain text files:

  1. tags2gallery.pl – Used to create pages of images based on photos’ tags.
  2. sets2gallery.pl – Used to create pages of image sets as well as the image “database”.
  3. make-home.pl – Used to create the Image Gallery home page.
  4. flickr2gallery.sh – A shell script calling each of the three scripts above and thus (re-)building the entire Image Gallery subsite. Currently, the process takes about sixty seconds.
  5. images.db – A tab-delimited list of each photograph’s local home page, title, and Flickr thumbnail.
  6. Images.pm – A really-rudimentary Perl module containing a single subroutine used to return a list of HTML img elements filled with links to random images.
  7. random-images.pl – Designed to be used as a server-side include, calls Images.pm to display sets of random images from images.db.

I know the Flickr API has been around for quite a while, and I know I’m a Johnny Come Lately when it comes to learning how to use it, but that does not mean it can’t be outlined here. The API provides a whole lot of functionality. Reading and writing of image content and metadata. Reading and writing information about users, groups, and places. Using the REST-like interface the programmer constructs a command in the form of a URL. The URL is sent to Flickr via HTTP. Responses are returned in easy-to-read XML.

A good example is the way I create my pages of images with a given tag. First I denote a constant which is the root of a Flickr tag search. Next, I define the location of the Infomotions pages on Flickr. Then, after getting a list of all of my tags, I search Flickr for images using each tag as a query. These results are then looped through, parsed, and built into a set of image links. Finally, the links are incorporated into a template and saved to a local file. Below lists the heart of the process:

  use constant S => 'http://api.flickr.com/services/rest/?
                                  method=flickr.photos.search&
                                  api_key=YOURKEY&user_id=YOURID&tags=';
  use constant F => 'http://www.flickr.com/photos/infomotions/';
  
  # get list of all tags here
  
  # find photos with this tag
  $request  = HTTP::Request->new( GET => S . $tag );
  $response = $ua->request( $request );
  
  # process each photo
  $parser    = XML::XPath->new( xml => $response->content );
  $nodes     = $parser->find( '//photo' );
  my $cgi    = CGI->new;
  my $images = '';
  foreach my $node ( $nodes->get_nodelist ) {
  
  # parse
  my $id     = $node->getAttribute( 'id' );
  my $title  = $node->getAttribute( 'title' );
  my $farm   = $node->getAttribute( 'farm' );
  my $server = $node->getAttribute( 'server' );
  my $secret = $node->getAttribute( 'secret' );
  
  # build image links
  my $thumb = "http://farm$farm.static.flickr.com/$server/$id" . 
              '_' . $secret . '_s.jpg';
  my $full  = "http://farm$farm.static.flickr.com/$server/$id" . 
              '_' . $secret . '.jpg';
  my $flickr = F . "$id/";
    
  # build list of images
  $images .= $cgi->a({ href => $full, 
                       rel => 'slideshow',
                       title => "<a href='$flickr'>Details on Flickr</a>"
                      },
                      $cgi->img({ alt => $title, src => $thumb, 
                      border => 0, hspace => 1, vspace => 1 }));
    
  }
  
  # save image links to file here

Notice the rel attribute (slideshow) in each of the images’ anchor elements. These attributes are used as selectors in a jQuery plug-in called ColorBox. In the head of each generated HTML file is a call to ColorBox:

  <script type="text/javascript">
    $(document).ready(function(){
      $("a[rel='slideshow']").colorbox({ slideshowAuto: false, 
                                         current: "{current} of {total}",
                                         slideshowStart: 'Slideshow',
                                         slideshowStop: 'Stop',
                                         slideshow: true,
                                         transition:"elastic" });
      });
  </script>

Using this plug-in I am able to implement a simple slideshow when the user clicks on any image. Each slideshow display consists of simple navigation and title. In my case the title is really a link back to Flickr where the user will be able to view more detail about the image, comment, etc.

barn ceilingkilnHesburgh Libraryself-portraitGiant EraserbirdsChristian Scientist ChurchRedwood Library

Summary and conclusion

I am an amateur photographer, and the fruits of this hobby are online here for sharing. If you use them, then please give credit where credit is due.

The use of Flickr as a “cloud” to host my images is very useful. It enables me to mirror my content in more than one location as well as provide access in multiple ways. When the Library of Congress announced they were going to put some of their image content on Flickr I was a bit taken aback, but after learning how the Flickr API can be exploited I think there are many opportunities for libraries and other organizations to do the same thing. Using the generic Flickr interface is one way to provide access, but enhanced and customized access can be implemented through the API. Lots of food for thought. Now to apply the same process to my movies by exploiting YouTube.

Shiny new website

Thursday, May 20th, 2010

Infomotions has a shiny new website, and the process to create it was not too difficult.

The problem

A relatively long time ago (in a galaxy far far away), I implemented an Infomotions website look & feel. Tabbed interface across the top. Local navigation down the left-hand side. Content in the middle. Footer along the bottom. Typical. Everything was rather square. And even though I used pretty standard HTML and CSS, its implementation was not really conducive to Internet Explorer. My bad.

Moreover, people’s expectations have increased dramatically since I first implemented my site’s look & feel. Curved lines. Pop-up windows. Interactive AJAX-like user experiences. My site was definitely not Web 2.0 in nature. Static. Not like a desktop application.

Finally, as time went on my site’s look & feel was not as consistently applied as I had hoped. Things were askew and the whole thing needed refreshing.

The solution

My ultimate solution is rooted in jQuery and its canned themes.

As you may or may not know, jQuery is a well-supported Javascript library supporting all sorts of cool things like drag ‘n drop, sliders, many animations, not to mention a myriad of ways to manipulate the Document Object Model (DOM) of HTML pages. An extensible framework, jQuery is also the foundation for many plug-in modules.

Just as importantly, jQuery supports a host of themes — CSS files implementing various looks & feels. These themes are very standards compliant and work well on all browsers. I was particularly enamored with the tabbed menu with rounded corners. (Under the hood, these rounded corners are implemented by a browser framework called Webkit. Let’s keep our eye on that one.) After learning how to implement the tabbed interface without the use of Javascript, I was finally on my way. As Dan Brubakerhorst said to me, “It is nothing but styling.”

None of Infomotions subsites are driven by hand-coded HTML. Everything comes from some sort of script. The Alex Catalogue is a database-driven website with mod-Perl modules. The water collection is supported by a database plus XSLT transformations of XML on the fly. The blog is WordPress. My “musings” are sets of TEI files converted in bulk into HTML. While it took a bit of tweaking in each of these subsites, the process was relatively painless. Insert the necessary divs denoting the menu bar, left-hand navigation, and content into my frameworks. Push the button. Enjoy. If I want to implement a different color scheme or typography, then I simply change a single CSS file for the entire site. In retrospect, the most difficult thing for me to convert was my blog. I had to design my own theme. Not too hard, but definitely a learning curve.

A feature I feel pretty strongly about is printing. The Web is one medium. Content on paper is another medium. They are not the same. In general, websites have more of a landscape orientation. Printed mediums more or less have portrait orientations. In the printed medium there is no need for global navigation, local navigation, nor hyperlinks. Silly. Margins need to be accounted for. Pages need to be signed, dated, and branded. Consequently, I wrote a single print-based CSS file governing the entire site. Pages print quite nicely. So nicely I may very well print every single page from my website and bind the whole thing into a book. Call it preservation.

In many ways I consider myself to be an artist, and the processes of librarianship are my mediums. Graphic design is not my forte, but I feel pretty good about my current implementation. Now I need to get back to the collection, organization, preservation, and dissemination of data, information, and knowledge.