[Xapian-discuss] Re: Xapian and research in IR: a few suggestions from experience

Emmanuel Eckard emmanuel.eckard at epfl.ch
Wed Sep 5 17:45:01 BST 2007


> Do you have some pointers to the models you have in mind so I can get
> an idea what sort of data we might be talking about?  (I can see this
> could be useful for storing a "reputation" score for each document,
> derived from link analysis, user clicks, etc.)

I am now working with PLSA (Probabilistic Latent Semantic Analysis), which 
assumes that documents (d) and terms (w) are associated with categories (z), 
and represents the data as mixture models of P(z), P(d|z) and P(w|z). I have 
implemented this model using objects built around Xapian. There is also some 
work done on Naive bayesian models, and Latent Dirichlet Allocation. 

All these models would call for doubles, or vectors of doubles, to be 
associated with Documents, TermIterators and Databases.

> > For a less important and fundamental suggestion, I'd like to mention that
> > in research, it is often important to have unique and determined
> > identifiers (strings) for documents. I have seen this done by using
> > prefixed terms (which is not very clean) or by using the "data" field of
> > documents (which lacks an iterator: one cannot jump to one particular
> > document easily this way). It might be interesting to do something on
> > this level (maybe simply by wrapping the "prefixed term" way into
> > something cleaner).
>
> What do you have in mind?  You can already add/replace or delete a
> document by term.  An overloaded version of get_document() which could
> retrieve the first document matching a particular term would be fairly
> easy to add and might save some internal work over creating a
> PostingIterator.

I was thinking of the toolchain of a scientist working on TREC, for instance: 
documents identified by string docIds are indexed, retrieval is applied, then 
the programme outputs a codified list of documents (the string docIds) which 
is used to evaluation with trec_eval. 

Presently I store the string docIds in the "data" field, so there would be no 
elegant way for me to retrieve a document given its string docId. But I have 
not felt the need for this yes, so it's a nicety, really.

Cheers !

-- 
Emmanuel Eckard                              
Artificial Intelligence Laboratory, EPFL
LIA/IC 1014 Ecublens, Suisse                     
+41 21 693 66 97       

()  ascii ribbon campaign - against html mail 
/\                        - against microsoft attachments       



More information about the Xapian-discuss mailing list