[Xapian-devel] GSOC : Language Modelling for information retrieval with Diversified Search results

Thu Mar 22 09:34:45 GMT 2012

On Thu, Mar 22, 2012 at 06:40:46AM +0530, Gaurav Arora wrote:
> Language Modelling for Information retrieval approach  focus on building
> probabilistic language models for documents and rank document based on
> probability of model generating the query.Technique is heavy and costlier
> than the traditional information retrieval technique but has proved to
> preform better in literature than traditional methods.
> 
> Language modelling approach performs better as it tries to capture word and
> phrase association to capture user context.

How well does it fit with Xapian's concept of what a weighting scheme is
though?

The scale of a project to implement this would be hugely different if
you are essentially implementing a Xapian::Weight subclass, or
implementing a whole new matcher, and possibly also new backend
data-structures.  I've not look closely enough at LM to know which
would be the case.

BTW, on the very next page to the one you link to says:

    Nevertheless, there is perhaps still insufficient evidence that its
    performance so greatly exceeds that of a well-tuned traditional
    vector space retrieval system as to justify changing an existing
    implementation.

http://nlp.stanford.edu/IR-book/html/htmledition/language-modeling-versus-other-approaches-in-ir-1.html

Both DfR and Learning to Rank are claimed to outperform BM25 and vector
space models too - do you know how LM compares to these?  I don't recall
seeing any such comparisons myself.

> 1. xapain supports relevance feedback(query expansion) through "
> Xapian::Enquire::get_eset" function.which algorithm is used to expand query
> in Enquire class.

The probabilistic one from the Robertson-Spärck Jones paper.

> Since search result diversification is its naive form performed by
> expanding query with different context and adding document from different
> context in final rank-list, thereby catering to all context of query.
> 
> I was thinking if i can use the algorithm implemented in expanded set for
> query expansion and implement a new algorithm in Search diversification in
> this way query expansion feature of xapian will also get powerful.

Possibly, but maybe it would be better to use an approach from the
literature which has already been tried and evaluated?

> 2. I have read that xapian supports passage retrieval ,proximity based
> query ,wildcard query and passage retrieval but I could not find any
> documentation or function providing these facilities of xapain.I will be
> glad if you can point me towards any available documentation describing to
> use such options.

I don't think we claim to support passage retrieval anywhere (I suppose
you could implement it by breaking large documents up into sections in
a second database and performing a second search within those).

Proximity and wildcards are both supported - you just aren't looking
very hard, for example:

http://xapian.org/search?P=wildcard

> I would be glad if mentors from xapian community can comment on my idea of
> implementing Language modelling technique and search result diversification
> as a project in scenario of Open Source Search Engine Library( xapian).
> Will implementing these techniques help xapian as a open source project?

Diversification of results is certainly something people have asked
about before - it would be useful in a lot of applications I think.

Language Modelling is interesting.  I think if it can be fitted into the
framework we already have it would be worthwhile to implement it.  I'm
not so sure if it would require a second matcher or even more to be
implemented.  That would be a lot of extra code to maintain.

Cheers,
    Olly