|
SPIRE 2010 will feature two invited talks
Mining the web graph
Web pages and the hyperlinks that connect them can be viewed as a graph. This graph can be mined for many purposes: ranking web search results, identifying online communities, detecting spam web pages, and more. However, the sheer size of the web graph and its ever-evolving nature make such computations very challenging. In this talk, I will describe various link-based ranking experiments conducted over a 17 billion edge graph, and the infrastructure that enabled us to perform these experiments. I will discuss the challenges we encountered (both engineering and theoretical), and offer some speculations as to where web graph mining is headed.
Compact Data Structures for IR: The Time Has Come
One decade ago we witnessed the birth of self-indexing, a technology
that caused a revolution in the way text databases can be stored and
indexed. The problem of storing huge data structures for searching sequences
virtually disappeared with this technique, that manages to store both text and index within the space of the compressed text. Those concepts are today making their way through application areas like Computational Biology and
handling Oriental languages, among others. Self-indexes build on the more
basic concept of compact data structures, which mimic their classical counterparts within much less space and sometimes, surprisingly, much wider
functionality.
Curiously, Information Retrieval has stayed comfortably out of the way.
There are several good reasons for that: (1) Its retrieval model is oriented to
natural language words rather than characters in general sequences. (2) Its favorite data structure, the inverted lists, is already well compressed (yet it
works for word-oriented natural language only). (3) It is less interested in retrieving the occurrences of a pattern, which is what self-indexes do, than in more sophisticated retrieval problems.
A few years ago, several researchers have started approaching the real IR
problems from a variety of perspectives, with the aim of using compact data
structures to improve the algorithmic base on which more sophisticated operations build. Their efforts have yielded several interesting results, which I will try to summarize in this talk. These can be divided into the following areas, which I have ordered, I suppose, from most to least radical for the eye of someone used to the inverted list solution.
(1) IR for general texts, and back.
A basic task in IR is "document retrieval", that is, find the documents where
a term appears (as opposed to finding all its individual occurrences). This can
be enriched with the requirement of finding the top-k documents for a query,
that is, find the k "most relevant" documents according to some relevance
measure. Conjunctive queries involving several terms are also quite popular.
This type of retrieval is very natural and there is no reason to think it would
not be of interest for general texts like DNA or protein sequences, let alone
databases in Oriental languages. Inverted lists are designed to solve this problem, but they
do not apply well to general texts.
It is possible to adapt self-indexes, which work on general sequences, to
solve these problems. Curiously, giving less information (just the documents where the occurrences appear) seems to require more space than giving all their
exact occurrences. I will cover the work that has been done in this topic, which has now reached a fair amount of achievements. Curiously again, the latest result are able to beat inverted lists when applied back on natural language texts.
The main open challenge in these solutions is to reduce the space redundancy
that is invariably introduced in the various solutions.
(2) Self-indexes for natural language.
Still in some cases IR systems require to find all the occurrences of a given
word or phrase. Self-indexes can be applied over natural language texts, quite
successfully if these are seen as sequences of words and separators, not
characters. Within the space of the compressed text, they fit the compressed
text and the inverted list functionality, achieving half the space of the
classical IR solution. Moreover, they handle phrase queries naturally and
and efficiently, without the intersections that are the headache of inverted list solutions.
The main open challenges in these solutions are that they are somewhat slow
to locate each individual occurrence (so they lose to inverted lists when
locating many occurrences), and that they work on main memory (locality of
reference is lost).
(3) Extended-functionality Compressed Inverted Lists
Finally, there have been some results on using compact data structures
for representing inverted lists in compressed form while providing much
wider functionality than supporting extraction and intersection of lists.
An inverted list can be seen as a binary relation between terms and documents.
By using recent rich-functional compact representations for binary relations
one can compete in space with classical inverted list compression, while
providing new functionalities such as on-the-fly stemming or other static
word groupings, restricting the search to ranges or subdirectories of documents, or to subtrees in XML documents, etc.
In the invited paper associated to this talk (with Simon Puglisi) we show also how wavelet trees can be used to achieve dual-ordered inverted lists, that
is, an inverted lists that is simultaneously sorted by document id (useful
for intersections and document retrieval) and by term frequency (useful for
ranked retrieval).
This is probably the area where most practical development and efficiency
comparisons with classical solutions are still needed, but certainly a very
promising one, as it is the one requiring least effort to plug into existing
systems relying on inverted lists.
|