SIGIR '99

This text represents the notes I took at SIGIR '99 , Berkeley, CA, August 16 - 19, 1999.

Monday, August 16

pretty view

Sather Gate

bell tower

Telegraph Avenue

Hal Varian (University Of California - Berkeley), "The Economics of Search" - The purpose of his talk was to share ideas and "cross fertilize" the fields of economics and information retrieval (IR). The talk elaborated on three ideas:

The value of information - Here he emphasized the economic value of information, and he posited that information is only valuable when it is new because it helps us make decisions.
Estimating degree of relevance - This is a function of the characteristics of a document and query. A use of logistic regression is one such method for determining relevance but only works in a data rich environment. Other options include regression analysis or nonparametric regression. In short, non-linear estimation is useful for prediction.
Optimal search behavior - Many times a user only wants to find the optimal results, the maximum result. An example includes the "pandora problem" articulated by Marty Meitzman where people can open boxes and get information optimally spending time and effort. The goal in such a scenerio is open the box with the greatest value while spending the least time and effort. After describing his point using a simple mathematical model he continued with the example where a person wanted to buy a travel book (Fodors and Lonely Planet). He summarized these ideas by saying risk and search cost are important factors for determining optimal search order and the stopping rule.

Micheal Miller (The Catholic University of America), "Visualization of Search Results" - Miller demonstrated text, 2-dimensional, and 3-dimensional models of displaying information. He did a study using novice and expert computer users to test the effectiveness of these models. The results of the test pointed to the fact that there is not much difference between the effectiveness of the text and 2-dimensional models, and much of the time text displays were fastest for answering questions about the documents, but the qualitative results demonstrated that color coding of documents helped discover related documents.

Ragnar Nordlie (Oslo College), "User Revealment" - Nordlie wanted to discover what system constructors can learn from reference librarians because online search systems are seen as unsatisfactory. Most people generate the errors from semantic failures, but people can not fix these errors easily. The reference interaction is very similar to online searching interaction, but when reference interviews are included in the information retrieval process the online searches succeed. He advocated the integration of more human interactive communication models into the online search system.

Gene Golovchincky (FX Palo Alto Laboratory, Inc.), "From Reading to Retrieval: Freeform Ink Annotations as Queries" - Golovchincky wants to be able to annotate documents and draw out information from these annotations. He wanted to establish a relevance value based on these annotations, and believes this can be proved because the annotations are easier to implement on paper (or on digital readers) and judgments did not have to be articulated. All of this is true because relevance was weighted on semantics and not necessarily statistics.

Steve Whittaker (AT&T), "SCAN" - There exist are large archives of speech data, but there are few techniques for accessing this information. The serial nature of speech makes retrieval much different from text retrieval. To satisfy this problem they (co-workers at AT&T) created a user-interface to the speech output. Their example was voice-mail archives where high volume users took notes on the messages, and his study tried to duplicate this note taking process. SCAN (spoken content audio navigation which includes a search) provides overview and transcription comments of the messages, and provided the means to search over the content of voice-mail messages even after messages have been saved as text.

Amit Singhal (AT&T), "Document Expansion for Speech Retrieval" - Singhal described what was "under the hood" of the system outlined in SCAN. In short, using document expansion speech searching can be greatly improved.

Yiming Yang ), "A Re-Examination Of Text Categorization Methods" - Yang wanted to explore more methods for automatic classification methods. She enumerated a number of previously published text categorizations (TC) methods and did a bit of compare and contrast accordingly.

Tuesday, August 17

busman's holiday

Eric and Roy

SunSITE Goddesses

Thomas Hofmann (University of California - Berkeley), "Probabilistic Latent Semantic Indexing" - He outlined latent sematic indexing (LSI) in terms of its strengths and weaknesses. To overcome some of its disadvantages he advocates bringing together documents in the collection to help locate new documents or determine the value of found documents, or to "decompose" the value of a document to find other documents. He needs now to create a probability language model to place in the decomposition formula.

Chris Ding (Lawrence Berkeley National Laboratory), "A Similarity-based Probability Model for Latent Semantic Indexing" - Ding gave an overview of LSI and why it seems to work; LSI works because it seems to capture the semantic meaning of documents by removing noise. Through a whole lot of mathematics he described a probability model for LSI.

David Losada (University of Corunna), "Using a Belief Revision Operator for Document Ranking in Extended Boolean Models" - He described Dalal's revision operator, based on belief, used to recreate Boolean queries, I think.

Jade Goldstein (Carnegie Mellon University & Just Research), "Summarizing Text Documents" - Summaries can manifest themselves as list of keywords, headlines, and/or text spans. Most of her discussion focused on the scoring and ranking of text spans. There are problems with summarization including: intrinsic vs. extrinsic, differences in corpra, or differences in compression levels (ie, the length of the summary compared to the length of the original document). She went on to describe how she tested some possibilities for doing the summarizations. Much of her technique was to add titles and first sentences to queries and thus create summaries.

Hongyan Juing (Columbia University), "The Decomposition of Human-Written Summary Sentences" - She described a process to discover whether or not a summary was created by cutting-and-pasting from original documents. This can be used to create summaries as well as to learn summary-writing rules.

Daniel Marcu (USC ISI), "The Automatic Construction of Large-scale Corpora for Summarization Research" - He proposed a solution to summarization by subtracting phrases from texts while the similarity between abstract and extracts remaining texts stays the same. He then described an experiment used to validate this hypothesis, and he believes his algorithm works as advertized. He thought it was important to compare and contrast the length of extracts with abstracts; extracts are usually longer than abstracts.

Marcia Bates (UCLA), "Applying User Research Directly to Information System Design" - She explored what would happen if scholars of the Getty Museum had free rain of online searching services. She outlined a model for search systems as an outcome of this exploration. She concluded, as per other research, that scientists used specific words and phrases whereas the humanists needed to combine more generic, broad phrases into Boolean queries. This means the indexing and thesauri must be designed accordingly. Humanities people don't do Boolean.

Annelise Mark Pejtersen (RISO) - She described how people came into a library asking for books but using search terms or phrases that were not necessary indexed. Examples included: no divorce, a happy ending, easy to read, etc. She proposed a framework to address these problems, and advocated creating new classifications accordingly. She analyzed the way people searched and advocated the creation of search system called the Bookhouse system. A problem with her system was the difference between men, women, and children and how each group of people thought.

Raya Fidel - Raya described the Web searching behavior of engineers. Her population was educated and experienced computer/Web users. She noticed they spent a lot of time selecting an information resource and identifying their need, and the moment they had the need they did the search. In short they did a lot of planning. Additionally, they spent a lot of time comparing their search results with their need. She advocated the following features of a systems: allow time for planning, help assess relevance, easy to understand, allow the users to locate their own strategies and rules, and find known item sites.

Efthimis Eftimiadis (University of Washington) - A comparison of Library and Information Science (LIS) and non-LIS searching behavior was done. The people spent a lot of time selecting search terms or identifying information needs. The ILS people knew about controlled vocabularies and the non-LIs people didn't. It seems the LIS people used more sophisticated search strategies. A system that would help these searchers would include: identify databases, identify need, provide broad to narrow techniques, access to controlled vocabularies, query expansion, multiple ranking assessments, and support "empirical" search strategies.

Wednesday, August 18

bear

fountain

Screaming Man

golden bear

Indian sculpture

Rila Mandala (Tokyo Institute of Technology), "Combining Multiple Evidence From Different Types Of Thesaurus For Query Expansion" - He gave an overview of query expansion techniques using hand-crafted thesauri or automatically constructed thesauri (co-occurance-based thesauri and head-modifier based thesauri). He then elaborated on the mathematical models underlying each of the techniques. He was advocating the combination of each of these techniques to improve query extraction and backup up his claims with data.

Jonghoon Lee (University of Illionios - Urbana Champagne), "Context-Sensitive Vocabulary Mapping With A Spreading Activation Network" - Using ADS, an astrophysics bibliographic database, he proposed a method for vocabulary mapping and conducted two experiment to verify his method. One problem with ADS is the heterogenous controlled vocabulary (thesauri terms). The problem is how to reconcile these differences between the thesauri. One solution is a lexical resemblance, syndetic structure, or co-occurance data. This final solution is the solution he used for his study. The first experiment was a term-to-term application and the second was a context-sensitive mapping where they wanted to predict vocabularies from one thesauri to the other. Using his method he believes he can predict about 90% of the vocabulary terms between controlled vocabularies. He proposed using this model to map vocabulary terms into full text databases.

Mark Sanderson (University of Sheffelid), "Deriving Concept Hierarchies From Text" - He was aiming to build a hierarchy like yahoo's automatically. Document clustering methods are one such method such as polythetic clustering, monothetic clustering. Using monthetic clustering he was creating term pairs from sets of documents. It was important to note that the heirarchiy is not parent-child heirachry but a "dag" where terms may have multiple parents. Using TREC data he tried to implement his model/idea and for the most part it seemed to work, and then he did a bit of user testing to validate his outcomes. To create this hierarchy he used document frequently as compared to terms.

Mario Mantzourogiannis (Hong Kong University of Science and Technology), "Content-Based Retrieval Using Heuristic Search" - (I'm sorry, but I have no idea what the presentation was about; while the overheads where simple, the presentation was totally over my head.)

Yuen Hsuin Tseng (Fu Jen Catholic University), "Content-Based Retrieval For Music Collections" - Tseng alluded to the specific problem associated with music retrieval since it includes different features: rhythm, melody, fragmentation, pitch error, key match, etc. To overcome some of these problems they extract the key melody from the music. Because these melodies are similar to key words and are repeated patterns. He then articulated an algorithm used to extract the melody. To experiment, Tseng retrieved MIDI files including pop and classical music from Internet sources. They converted the MIDI files into text strings and began looking for patterns. They can convert the string back into MIDI. Queries in the system are input as text strings. Files can be uploaded into the system as file as well. Their experiment was to the see whether or not the input from queries from strings, files, singing, or notation to see if the engine was successful. For the most part, they felt successful with their experiments.

Eamonn Keogh (University Of California - Irvine), "Relevance Feedback Retrieval Of Time Series Data" - He started out describing time series data, and a lot of it, graphs of data. There are patterns to these time series data sets. The usual problem with this is it takes a lot of time compute and explained a few ways of speeding it up. Using relevance feedback Keogh assigned weights to parts of graphs to locate answers to queries. This feedback is displayed as a graph where the user can select values associated with the graphs. Using precision and recall measures Keogh was able to quickly find similar pattern to create new queries. Unfortunately, global distortions in the data make relevance feedback unreliable. This is somewhat overcome with the implementation of user profiles. These profiles articulate how much global distortion users are willing to accept. Combining these methods Keogh believes he can effectively locate intervals in time series data.

Thursday, August 19

giant giraffes	San Francisco
giant anchor	Sigma Nu

Caroline Eastman (University Of South Carolina), "Customizable Information Components" - Eastman began with an overview of adaptive hypermedia, and her model is more about adaptive presentation, Customizable information components (CIC) she called them. CIC includes base information components, derived information components, as well as auxiliary information components. They are different parts of the original text. Derivations of CIC can be automatic/manual, reversible/non-reversible, or preserving/non-preserving. System implementations included: X windows, Hypercard user-controlled adaptation, where a limited number of transformations supported. She alluded to the Hypercard implementation that was a department reference handbook. She also alluded to limited experiments. It seemed she had difficulty quantifying her experimental results. One project she is working on is a adaptive interface to a database about deer mice.

John Tait (University of Sunderland), "Syntactic Simplification Of Newspaper Text For Aphasic Readers" - Tait wanted to adapt the local newspaper for people who suffer from aphasia, people who have difficulty with their vocabulary after brain damage. These people are simply unable to read well. His goal was to allow the users to retain comprehension during the reading process. His system was called Pset, and he then outlined how the system is/has been built; the system is essentially an implementation of a pattern matching algorithm.

Cecile Paris (CSRO), "Adaptive Presentations" - Paris looked at the problem of adaptive hypermedia as one of communication, specifically communication between computer and user. She wanted to combine natural language processing with user modeling. She advocates combining these things to create dynamically created hypertext, capture what the user has done before, for the purposes of predicting and helping the user in further dialog. She briefly describe a pair of implemented systems using principles outlined.

Eric Lease Morgan (NC State University), "MyLibrary@NCState: The Implementation of a User-centered, Customizable Interface to a Library's Collection of Information Resources" - After three days of SIGIR, I re-organized my presentation so it fit the presentation format I had seen repeated throughout the conference. I then described MyLibrary@NCState and asked the audience for suggestions how to evaluate the system's effectiveness. Suggestions included: 1) measure how often people come back to the service, 2) measure how often people change their profiles, 3) measure how well people perform specific tasks as well as compare and contrast how people use the customizable interface and the non-customizable interface, 4) make sure to get some base-line data, 5) capture people's searches, 6) measure how many people go back to the old service, 7) do some sort of controlled experiment, 8) allow people to annotate items in their personal collections, 9) after people do a task ask them to evaluate how well the task was implemented, 10) model how people find information with and without the system, 11) try implementing the service implicitly or explicitly, 12) measure users' expectations, 13) keep track of where users' go, and 14) allow users to customize the layout and measure what is closer to the top of their pages. For a more complete description of my presentation, see http://infomotions.com/musings/sigir-99/ .

Richaro Mazza "Swisscat" - Mazza wanted to study communication as it happens on the Internet. He was a part of a set of interdisciplinary research team including a communications theorist, visual communications specialized (GUI), information technology expert (programmer), and a systems administrator. He articulated the process via push as well as pull technologies. Swisscat gathers pharmaceutical information, classifies the information, and finally delivers the information according to a profile. Much of his work is available at http://swisscast.ti-edu.ch/ .

Nick Belkin (Rutgers University), "Issues in User Modeling in IR" - The presentation was mostly a presentation of homilies to articulate some of the problem we face in user modeling and IR. User modeling is in part a problem of resolving the user's problem, and problems of how to represent the data the person would like to retrieve. Presently IR has addressed the information needs, knowledge of a topic, knowledge of a system, and goals of the user. We are doing this in order to predict, learn, guide, or lead people. One of the first examples of SDI services was offered through NLM through MeSH and MEDLINE. Problems to explore include UI, methods of knowledge acquisition, and being sure of the information is valid.

Achim Leubner (University Of Augsburg), "Personalized Nonlinear Ranking Using Full-Text Preferences" - In his system preferences are "partial orders" and some of his preference allow for prioritization. He combines preference to allow for this prioritization. These preferences are used to create a query applied against a database and the results are returned in a relevance feedback order.

Henry Naftulin (University Of California - Davis), "Organizing Information for the Internet and E-Commerce" - He was interested in ways of presenting information from search results. His particular domain was electronic commerce. He presented an overview of the current searching process and alluded to its various problems and difficulties. In short, catalogs are not organized. There is no way to manipulate found objects. A proposed solution is to use data clustering and present results in a hierarchial manner.

David Harper (Robert Gordon University), "Effective Information Access" - He outlined an information delivery chain: information production, database production, database querying, information seeking, information consumption. Each of these points provide opportunities for customization. This can be articulated as implicit or explicit customization. There are three types of services in his application (Sketchtrieve) search, filter, and customization of appearance, functionality, or hidden controls. This implementation creates a "2-dimensional canvas" service for conducting the entire search process. These services can be cloned and then modified. Another customization technique was the use of clustering implementation (Webcluster) after browsing. It seemed a lot like Bates' "pearl growing". He advocated group customization as opposed to individual customization because individual customizations may be less able to be communicated to others.

Carolyn Watters ), "Adaptive Hypertext And Medical Portals: Tenacious Hypertext" - She described a medical portal application, a view of the web. In her implementation she wanted personalization, device independence, and group support. It is adaptable in terms of form, content and function, but not necessarily performance. She displayed the interface to the portal and it included many of the usual suspects of medical information. She calls the system tenacious because it provides continuity between sessions, devices, and groups over periods of time; it maintains state.

Richard Bodner (University of Toronto), "Guided Personalization in Information Browsing" - His goal is to outline an approach for personalization based on interaction between a reader user model and authoring expertise within a dynamic hypertext framework. He wants to blend hypertext links to queries. He called this a dynamic hypertext because the links were created on-the-fly. Documents are associated with 'nudging rules' (qualities) allowing for the creation of these new dynamically created documents/links.

Ross Wilkinson (CSIRO), "Evaluation" - Recall and precision is difficult to do in adaptive technology. Instead we might explore the possibilities of psychological evaluation techniques. We can evaluate system complexity, user satisfaction, communication cost, task effectiveness. IR is difficult to measure in terms of relevance and precision because there is a lot of noise in IR; thus IR experiments are not statistically significant.

Summary

Eric, Diana, and Dave

frisbee dudes

shortest hole

I enjoyed this conference. It stretched me, and that is a good thing.

There was a pattern with the presentations of these people. Each presentation seemed to consist of the following parts:

Overview of presentation
Problem statement
Proposed solution or model for solving the problem (theory)
Creation and execution of an experiment (observation)
Results
Discussion
Future work
Conclusion

The problem statements addressed the same issues I as a librarian would normally address, but the solutions and experiments were simply not the methods I use to solve problems. After a bit of reflection, I believe I should try this more scientific approach to problem solving. Along a similar line, many of the issues addressed at this conference were the same issues I address at my work as a librarian, but the terms they used to reference these issues were different. For example, clustering could be a synonym for cataloging and classification. Summarizing is just like abstracting. Query expansion is sort of like an automated form of the reference interview or "pearl growing". I also got the distinct impression that all of these people assumed there was some sort of underlying pattern or organization to the things they were studying. If not, then all of their experiments would be for nought. Are things as consistent as people make them out to be? Even if I can devise a mathematical formula or 'natural law' predicting behavior or actions, that law, in these cases will most likely only be able to predict probability and not necessarily certainty.

Like I said above, this conference stretched me, and like I said in other venues, the sort of conferences I've attended this year that were outside my particular professional discipline have enriched me. (See Languaging '99 and CAP '99 .) Each culture I have visited has had its own way of understanding its discipline. If I were to make a gross generalization, I would say the humanists put more stock and emphasis on values and deduction to understand their reality while the scientists based the truth of their studies on demonstrated experiment (inference) and mathematics. In my opinion, neither method can stand alone. Rather, both must be used to understand one's world. The strengths of one method make up for the weaknesses of the other. Arscience.

On a more personal note, this was my first visit to Berkeley. I enjoyed soaking up the expressive nature of the incoming college students. I remember the beating of the drums and I remember also the necessity of telling somebody that I hate them before they would believe what I was saying. I saw the computer where the Alex Catalogue resides, and I had a professional visit with Roy Tenant. I read and wrote in an opulent reading room of the Doe library, and "I left my mark". Visiting David Cherry (illustrator for Tricks ) and his wife was very nice. It is always good to keep in touch with your past. I experienced an earthquake while riding the subway, but nothing happened. Finally, I had just about my worst round of golf on the course with the shortest hole!

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This text was never formally published
Date created: 1999-08-23
Date updated: 2004-11-25
Subject(s): travel log; Berkeley, CA; information retrieval; adaptive hypermedia;
URL: http://infomotions.com/musings/sigir-99-notes/