Los Cabos  

October 11-13 2010

Los Cabos, México

 
line decor
  
line decor
 
 
 
 

 
 
String Processing and Information Retrieval Symposium

SPIRE 2010 will feature two invited talks

Mark Najork (Microsoft Research)

Mining the web graph

Web pages and the hyperlinks that connect them can be viewed as a graph. This graph can be mined for many purposes: ranking web search results, identifying online communities, detecting spam web pages, and more. However, the sheer size of the web graph and its ever-evolving nature make such computations very challenging. In this talk, I will describe various link-based ranking experiments conducted over a 17 billion edge graph, and the infrastructure that enabled us to perform these experiments. I will discuss the challenges we encountered (both engineering and theoretical), and offer some speculations as to where web graph mining is headed.

Gonzalo Navarro (U. Chile, Chile)

Compact Data Structures for IR: The Time Has Come

One decade ago we witnessed the birth of self-indexing, a technology that caused a revolution in the way text databases can be stored and indexed. The problem of storing huge data structures for searching sequences virtually disappeared with this technique, that manages to store both text and index within the space of the compressed text. Those concepts are today making their way through application areas like Computational Biology and handling Oriental languages, among others. Self-indexes build on the more basic concept of compact data structures, which mimic their classical counterparts within much less space and sometimes, surprisingly, much wider functionality.

Curiously, Information Retrieval has stayed comfortably out of the way. There are several good reasons for that: (1) Its retrieval model is oriented to natural language words rather than characters in general sequences. (2) Its favorite data structure, the inverted lists, is already well compressed (yet it works for word-oriented natural language only). (3) It is less interested in retrieving the occurrences of a pattern, which is what self-indexes do, than in more sophisticated retrieval problems.

A few years ago, several researchers have started approaching the real IR problems from a variety of perspectives, with the aim of using compact data structures to improve the algorithmic base on which more sophisticated operations build. Their efforts have yielded several interesting results, which I will try to summarize in this talk. These can be divided into the following areas, which I have ordered, I suppose, from most to least radical for the eye of someone used to the inverted list solution.

(1) IR for general texts, and back.

A basic task in IR is "document retrieval", that is, find the documents where a term appears (as opposed to finding all its individual occurrences). This can be enriched with the requirement of finding the top-k documents for a query, that is, find the k "most relevant" documents according to some relevance measure. Conjunctive queries involving several terms are also quite popular.

This type of retrieval is very natural and there is no reason to think it would not be of interest for general texts like DNA or protein sequences, let alone databases in Oriental languages. Inverted lists are designed to solve this problem, but they do not apply well to general texts.

It is possible to adapt self-indexes, which work on general sequences, to solve these problems. Curiously, giving less information (just the documents where the occurrences appear) seems to require more space than giving all their exact occurrences. I will cover the work that has been done in this topic, which has now reached a fair amount of achievements. Curiously again, the latest result are able to beat inverted lists when applied back on natural language texts.

The main open challenge in these solutions is to reduce the space redundancy that is invariably introduced in the various solutions.

(2) Self-indexes for natural language.

Still in some cases IR systems require to find all the occurrences of a given word or phrase. Self-indexes can be applied over natural language texts, quite successfully if these are seen as sequences of words and separators, not characters. Within the space of the compressed text, they fit the compressed text and the inverted list functionality, achieving half the space of the classical IR solution. Moreover, they handle phrase queries naturally and and efficiently, without the intersections that are the headache of inverted list solutions.

The main open challenges in these solutions are that they are somewhat slow to locate each individual occurrence (so they lose to inverted lists when locating many occurrences), and that they work on main memory (locality of reference is lost).

(3) Extended-functionality Compressed Inverted Lists

Finally, there have been some results on using compact data structures for representing inverted lists in compressed form while providing much wider functionality than supporting extraction and intersection of lists.

An inverted list can be seen as a binary relation between terms and documents. By using recent rich-functional compact representations for binary relations one can compete in space with classical inverted list compression, while providing new functionalities such as on-the-fly stemming or other static word groupings, restricting the search to ranges or subdirectories of documents, or to subtrees in XML documents, etc.

In the invited paper associated to this talk (with Simon Puglisi) we show also how wavelet trees can be used to achieve dual-ordered inverted lists, that is, an inverted lists that is simultaneously sorted by document id (useful for intersections and document retrieval) and by term frequency (useful for ranked retrieval).

This is probably the area where most practical development and efficiency comparisons with classical solutions are still needed, but certainly a very promising one, as it is the one requiring least effort to plug into existing systems relying on inverted lists.

 

 
 

 

Topics
The scope of the SPIRE series of conferences includes not only fundamental algorithms in SP and IR but also contributions in different application areas such as computational biology, DNA sequencing, WWW based IR systems, and IR related languages like SGML and XML. Given its inter-disciplinary nature, SPIRE offers a singular opportunity for researchers from computer science, information science, and engineering interested in working with problems related to these areas to meet and discuss how to collaborate towards the best solutions. Typical topics of interest include, but are not limited to

  • String Processing: Dictionary algorithms, Text searching, Pattern matching, Text and sequence compression, Automata based string processing.
  • Information Retrieval: Information retrieval models, Indexing, Ranking and filtering, Interface design, Visualization, Benchmarking.
  • Natural language processing:Text analysis, Text mining, Machine learning, Information extraction, Language models (both structural and semantic), Knowledge representation.
  • Search applications and usage: Cross-lingual information access systems, Multimedia information access, Digital libraries, Collaborative retrieval and Web related applications, Semi-structured data retrieval, Evaluation.
  • Interaction of biology and computation: DNA sequencing and applications in molecular biology, Evolution and phylogenetics, Recognition of genes and regulatory elements, Sequence driven protein structure prediction.
  • Efficient implementation of IR systems: Practical implementations with strong experimental support, toolkits for IR systems. Algorithms and data structures for IR.