[Xapian-discuss] Xapian and research in IR: a few suggestions from experience

Mon Sep 3 10:27:27 BST 2007

Greetings,

In its present state, Xapian offers Databases and Documents, from where 
TermIterators and PostingIterators allow accessing individual terms and 
documents, as well as information such as Term Frequency (called "Word 
Document Frequency", wdf) and idf (Invert Document Frequency) terms.

Recent research in information retrieval has focused on retrieval models based 
entirely or partially on "latent" features of documents and document 
collections (for instance sets of probabilities computed from the TF and IDF 
terms). This comes down to associating "additional data"  to documents, terms 
and document collections -- the nature of the information and to what it is 
associated varies according to the model.

An extension to Xapian allowing to generically support such retrieval models 
would make Xapian a huge asset for the Information retrieval research 
community (as well as Document classification and connex topics). I propose 
to create a layer offering generic methods for access, modification, reading 
and loading such data. As far as I understand, this would imply

1) Set generic "slots" on all objects susceptible of being associated with 
latent data in a latent retrieval model; this would include the major 
components of Xapian: Databases, Documents and Termiterators 
(the "Termiterator::get_latent_data()" method could ideally have the 
behaviour of Termiterator::get_wdf(), by "knowing" whether it's been 
instanciated from a Document or a Database and returning different results 
depending on this context).
2) Offer generic read/write methods. Coherence of the data should be 
maintained, for instance by offering virtual 
mehods "documentAdded()", "documentRemoved()", ... to allow the user to apply 
necessary operations on his data is needed.

For a less important and fundamental suggestion, I'd like to mention that in 
research, it is often important to have unique and determined identifiers 
(strings) for documents. I have seen this done by using prefixed terms (which 
is not very clean) or by using the "data" field of documents (which lacks an 
iterator: one cannot jump to one particular document easily this way). It 
might be interesting to do something on this level (maybe simply by wrapping 
the "prefixed term" way into something cleaner).

Thank you very much for your interest, and most of all for Xapian itself.

-- 
Emmanuel Eckard                              
Artificial Intelligence Laboratory, EPFL
LIA/IC 1014 Ecublens, Suisse                     
+41 21 693 66 97       

()  ascii ribbon campaign - against html mail 
/\                        - against microsoft attachments