[Xapian-discuss] Xapian and research in IR: a few suggestions from
experience
Emmanuel Eckard
emmanuel.eckard at epfl.ch
Mon Sep 3 10:27:27 BST 2007
Greetings,
In its present state, Xapian offers Databases and Documents, from where
TermIterators and PostingIterators allow accessing individual terms and
documents, as well as information such as Term Frequency (called "Word
Document Frequency", wdf) and idf (Invert Document Frequency) terms.
Recent research in information retrieval has focused on retrieval models based
entirely or partially on "latent" features of documents and document
collections (for instance sets of probabilities computed from the TF and IDF
terms). This comes down to associating "additional data" to documents, terms
and document collections -- the nature of the information and to what it is
associated varies according to the model.
An extension to Xapian allowing to generically support such retrieval models
would make Xapian a huge asset for the Information retrieval research
community (as well as Document classification and connex topics). I propose
to create a layer offering generic methods for access, modification, reading
and loading such data. As far as I understand, this would imply
1) Set generic "slots" on all objects susceptible of being associated with
latent data in a latent retrieval model; this would include the major
components of Xapian: Databases, Documents and Termiterators
(the "Termiterator::get_latent_data()" method could ideally have the
behaviour of Termiterator::get_wdf(), by "knowing" whether it's been
instanciated from a Document or a Database and returning different results
depending on this context).
2) Offer generic read/write methods. Coherence of the data should be
maintained, for instance by offering virtual
mehods "documentAdded()", "documentRemoved()", ... to allow the user to apply
necessary operations on his data is needed.
For a less important and fundamental suggestion, I'd like to mention that in
research, it is often important to have unique and determined identifiers
(strings) for documents. I have seen this done by using prefixed terms (which
is not very clean) or by using the "data" field of documents (which lacks an
iterator: one cannot jump to one particular document easily this way). It
might be interesting to do something on this level (maybe simply by wrapping
the "prefixed term" way into something cleaner).
Thank you very much for your interest, and most of all for Xapian itself.
--
Emmanuel Eckard
Artificial Intelligence Laboratory, EPFL
LIA/IC 1014 Ecublens, Suisse
+41 21 693 66 97
() ascii ribbon campaign - against html mail
/\ - against microsoft attachments
More information about the Xapian-discuss
mailing list