[Xapian-devel] GSOC : Language Modelling for information retrieval with Diversified Search results

Gaurav Arora gauravarora.daiict at gmail.com
Sat Mar 24 19:52:50 GMT 2012


On Thu, Mar 22, 2012 at 3:04 PM, Olly Betts <olly at survex.com> wrote:

> On Thu, Mar 22, 2012 at 06:40:46AM +0530, Gaurav Arora wrote:
> > Language Modelling for Information retrieval approach  focus on building
> > probabilistic language models for documents and rank document based on
> > probability of model generating the query.Technique is heavy and costlier
> > than the traditional information retrieval technique but has proved to
> > preform better in literature than traditional methods.
> >
> > Language modelling approach performs better as it tries to capture word
> and
> > phrase association to capture user context.
>
> How well does it fit with Xapian's concept of what a weighting scheme is
> though?
>
>   Language Modelling  deviates a bit from the xapian concept of weighting
scheme.xapian considers weighting of document as *sum of scores* for
individual query keywords( score of terms taken from get_sumpart() from
weight class) added with extra part of score(taken from get_extrapart() of
weight class). whereas language modelling considers weighting of document
as *product of probability* of term occurring given the language model of
document.


> The scale of a project to implement this would be hugely different if
> you are essentially implementing a Xapian::Weight subclass, or
> implementing a whole new matcher, and possibly also new backend
> data-structures.  I've not look closely enough at LM to know which
> would be the case.
>
>    As per requirements of LM and available framework of xapian,inclusion
of LM to xapian will require

   1.) implementation of Xapain::Weight subclass for Language Model .
   2.) Modification of Matcher or new matcher  since LM requires to weight
by product of probability of term given with application of smoothing
factors using collection frequency of words and  normalized document length
(*Dirichlet* *smoothing*) instead of xapian current implementation which
sums individual contribution for terms.

LM requires collection frequency for the words while doing smoothing, i
checked using xapian-delve tool these statistics are maintained by the
backend data-structures. it <http://data-structures.it> will save
implementation of language  model from adding any data-structure to the
backend.


> BTW, on the very next page to the one you link to says:
>
>    Nevertheless, there is perhaps still insufficient evidence that its
>    performance so greatly exceeds that of a well-tuned traditional
>    vector space retrieval system as to justify changing an existing
>    implementation.
>
>
> http://nlp.stanford.edu/IR-book/html/htmledition/language-modeling-versus-other-approaches-in-ir-1.html
>
> Both DfR and Learning to Rank are claimed to outperform BM25 and vector
> space models too - do you know how LM compares to these?  I don't recall
> seeing any such comparisons myself.
>
>
LM cliams to outperform both BM25 and Vector space model when used with KL
divergence and Dirichlet smoothing.otherwise if LM is used without KL
divergence and Dirichlet smoothing BM25 outperforms LM.

 A comparative study which proves LM outperform BM25 model is presented by
http://dl.acm.org/citation.cfm?id=1378323 on page 6 ,Unigram language model
based on Kullback-Leibler divergence, using Dirichlet prior smoothing gives
best results and outperforms these models.

whereas in one comparison in available in
http://dl.acm.org/citation.cfm?id=1813934 shows DFR outperform LM. I could
not get any comparison between LM and Learning to rank.

Clinchant et. al. in Bridging Language Modeling and Divergence from
Randomness Models ( http://dl.acm.org/citation.cfm?id=1612342 ) propose a
approach which presents a log-logistic model which bridges gap between LM
and DFR and have proved to be performing either comparatively and also
outperform Language model on some standard document collection.

I would like to share one conjecture from initial paper of LM by Bruce
croft:

LM is simple enough to be explained to a user at intuitive level and
understanding of it will facilitate information in forming better queries.A
user that understand system at intuitive level will tend to think in-terms
of word which will help system distinguish document of interest from
everything else.

this intuitive behavior of LM with comparable performance of BM25 might
prove beneficial when xapian is used as application by real set of users to
search *gmane* due to better intuitive understanding of system which is
difficult to attain in other models due to inherent complex heuristics used.

Do mentors think it  will still be useful to include LM in xapian with
amount of work specified by modifying matcher and implementing weight
subclass?


> > Since search result diversification is its naive form performed by
> > expanding query with different context and adding document from different
> > context in final rank-list, thereby catering to all context of query.
> >
> > I was thinking if i can use the algorithm implemented in expanded set for
> > query expansion and implement a new algorithm in Search diversification
> in
> > this way query expansion feature of xapian will also get powerful.
>
> Possibly, but maybe it would be better to use an approach from the
> literature which has already been tried and evaluated?
>
>   yes,sorry for vague term in my description,with new algorithm i meant
algorithm not implemented in xapian but reported to be tried and evaluated
in literature.

I did a little survey on the Diversification of Search results.I found
algorithm categorized in two category viz. implicit and explicit.
Implicit algorithm use taxonomy or other categorization information
knowledge available like taxonomy available by http://www.dmoz.org .
whereas explicit algorithm avoid using any such information and relies on
queries and documents content and statistics only.

i found most of algorithm using external resources like category
information available from open directory(http://www.dmoz.org) , query
reformulation available from commercial search engine, association on third
party corpus,query logs of a search engine . Best approach was one which
used query reformulation from commercial search engines.

Approach name: MMR
Approach description: Jaime G. Carbonell et. al. introduced MMR in search
result diversification . The use of MMR, diversity-based reranking for
reordering documents and producing summaries don't use any external
resource and have shown considerable good results without resources.

Approach name: Q-Filter
Approach description: Radlinski et. al. proposed using query logs of search
engine to form sub queries of given query to disambiguate user intent.Then
using simple approach to add documents for all formed sub queries in rank
list.

Approach name: IA-Select
Approach description: Agrawal et. al. proposed using taxonomy for documents
and query.Adding documents to the ranklist using greedy algorithm based on
quality of document and category.algorithm due to greedy nature ensures all
category(or user intent) will be served and don't unnecessary add document
if good document for a category have been added based on utility function
defined by them.

Approach name: Quad
Approach description: Quad framework exploit query reformulations provided
 by three major Web search engines (WSEs) as a means to uncover different
query aspects.

We can use Q-Filter and Quad approach with association calculated on third
party corpus available on any standard corpus.

PFA attached image for comparison of proposed approach.

Which approach would  best fit and can be used to implement search result
diversification in xapian ?

Is it a good idea to use resource hungry techniques like using taxonomy
information for search diversification in xapian .

References:
http://dl.acm.org/citation.cfm?id=1526761
http://dl.acm.org/citation.cfm?id=1498766
http://dl.acm.org/citation.cfm?id=1772780

>
> > I would be glad if mentors from xapian community can comment on my idea
> of
> > implementing Language modelling technique and search result
> diversification
> > as a project in scenario of Open Source Search Engine Library( xapian).
> > Will implementing these techniques help xapian as a open source project?
>
> Diversification of results is certainly something people have asked
> about before - it would be useful in a lot of applications I think.
>
> Language Modelling is interesting.  I think if it can be fitted into the
> framework we already have it would be worthwhile to implement it.  I'm
> not so sure if it would require a second matcher or even more to be
> implemented.  That would be a lot of extra code to maintain.
>
> Cheers,
>    Olly



-- 
with regards
Gaurav A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120325/8bf2504a/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: results_search_diversification.png
Type: image/png
Size: 38804 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120325/8bf2504a/attachment-0001.png>


More information about the Xapian-devel mailing list