[Xapian-devel] GSOC : Language Modelling for information retrieval with Diversified Search results

Mon Mar 26 03:35:49 BST 2012

On Sun, Mar 25, 2012 at 01:22:50AM +0530, Gaurav Arora wrote:
> On Thu, Mar 22, 2012 at 3:04 PM, Olly Betts <olly at survex.com> wrote:
> 
> > On Thu, Mar 22, 2012 at 06:40:46AM +0530, Gaurav Arora wrote:
> > > Language Modelling for Information retrieval approach  focus on building
> > > probabilistic language models for documents and rank document based on
> > > probability of model generating the query.Technique is heavy and costlier
> > > than the traditional information retrieval technique but has proved to
> > > preform better in literature than traditional methods.
> > >
> > > Language modelling approach performs better as it tries to capture word
> > and
> > > phrase association to capture user context.
> >
> > How well does it fit with Xapian's concept of what a weighting scheme is
> > though?
>
> Language Modelling  deviates a bit from the xapian concept of weighting
> scheme.xapian considers weighting of document as *sum of scores* for
> individual query keywords( score of terms taken from get_sumpart() from
> weight class) added with extra part of score(taken from get_extrapart() of
> weight class). whereas language modelling considers weighting of document
> as *product of probability* of term occurring given the language model of
> document.

If you take the logarithm (which is a monotonic function, so the
ordering is unaffected) then you can convert the product to a sum
(since the logarithm of a product is the sum of the logarithms of
the things being multiplied together).  This trick is actually
part of the derivation of the probabilistic weighting formula.

So as long as the product only includes probabilities for terms which
actually occur in the particular document being considered, then this
isn't a problem at all.

> > The scale of a project to implement this would be hugely different if
> > you are essentially implementing a Xapian::Weight subclass, or
> > implementing a whole new matcher, and possibly also new backend
> > data-structures.  I've not look closely enough at LM to know which
> > would be the case.
>
> As per requirements of LM and available framework of xapian,inclusion
> of LM to xapian will require
> 
>    1.) implementation of Xapain::Weight subclass for Language Model .
>    2.) Modification of Matcher or new matcher  since LM requires to weight
> by product of probability of term given with application of smoothing
> factors using collection frequency of words and  normalized document length
> (*Dirichlet* *smoothing*) instead of xapian current implementation which
> sums individual contribution for terms.

I'm unclear if (2) is actually required, or can be avoided with the
logarithm trick.  But reimplementing the matcher is a lot of work to do
well.

> LM requires collection frequency for the words while doing smoothing, i
> checked using xapian-delve tool these statistics are maintained by the
> backend data-structures.

Yes, collection frequency is stored.

> it will save implementation of language  model from adding any
> data-structure to the backend.

Parth seemed to think you'd probably want co-occurrence data for terms
for LM, which isn't something we currently track (and is probably rather
big, so a way to store it compactly would be quite important; it does at
least potentially have other uses).  Or are you proposing to only look
at unigram models?

> Do mentors think it  will still be useful to include LM in xapian with
> amount of work specified by modifying matcher and implementing weight
> subclass?

I think it would be a useful addition, but I think it would be good to
pin down what models you're wanting to tackle, and what needs changing
to support them, at least at the conceptual level.

> Which approach would  best fit and can be used to implement search result
> diversification in xapian ?

I've looked into this a bit before (I went to the "Diversity in Document
Retrieval" workshop at ECIR 2011) and my thinking was that the approach
which would fit best (or at least most easily) with Xapian was one which
reranked the result set obtained from the query (such as MMR).

A framework which supported reordering after the search would also be
useful for other operations (such as clustering of results, and indeed
the Learning to Rank work can be thought of as a reranking).

Reformulating the query would also not be hard to fit in, but seems (at
least potentially) to suffer from the problem that one type of result
may still dominate.

Trying to use external search engines to expand the query is OK for a
research project, but is likely to result in your IP address getting
blocked by them for abusive behaviour if used in a busy real world
system.  Sending queries to third parties also has privacy implications.

Approaches which need a taxonomy have the disadvantage of ... well,
needing a taxonomy, so unless they're significantly better, ones which
don't see a preferable option.

> Is it a good idea to use resource hungry techniques like using taxonomy
> information for search diversification in xapian .

I guess it depends on just how resource hungry you are talking.  The
more resources the techniques need, the fewer people would actually be
able to deploy them.  And if it takes seconds (or longer) to run a
diversified search, that's not really useful for interactive search,
which is by far the most common use.

Cheers,
    Olly