<div class="gmail_quote">On Thu, Mar 22, 2012 at 3:04 PM, Olly Betts <span dir="ltr">&lt;<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div>On Thu, Mar 22, 2012 at 06:40:46AM +0530, Gaurav Arora wrote:<br>

&gt; Language Modelling for Information retrieval approach  focus on building<br>

&gt; probabilistic language models for documents and rank document based on<br>

&gt; probability of model generating the query.Technique is heavy and costlier<br>

&gt; than the traditional information retrieval technique but has proved to<br>

&gt; preform better in literature than traditional methods.<br>

&gt;<br>

&gt; Language modelling approach performs better as it tries to capture word and<br>

&gt; phrase association to capture user context.<br>

<br>

</div>How well does it fit with Xapian&#39;s concept of what a weighting scheme is<br>

though?<br>

<br></blockquote><div>  Language Modelling  deviates a bit from the xapian concept of weighting scheme.xapian considers weighting of document as <b>sum of scores</b> for individual query keywords( score of terms taken from get_sumpart() from weight class) added with extra part of score(taken from get_extrapart() of weight class). whereas language modelling considers weighting of document as <b>product of probability</b> of term occurring given the language model of document. </div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

The scale of a project to implement this would be hugely different if<br>

you are essentially implementing a Xapian::Weight subclass, or<br>

implementing a whole new matcher, and possibly also new backend<br>

data-structures.  I&#39;ve not look closely enough at LM to know which<br>

would be the case.<br>

<br></blockquote><div>   As per requirements of LM and available framework of xapian,inclusion of LM to xapian will require </div><div><br></div><div>   1.) implementation of Xapain::Weight subclass for Language Model .</div>


<div>   2.) Modification of Matcher or new matcher  since LM requires to weight by product of probability of term given with application of smoothing factors using collection frequency of words and  normalized document length (<em style="line-height:16px;font-style:normal;font-family:arial,sans-serif;font-weight:bold">Dirichlet</em><span style="line-height:16px;color:rgb(34,34,34);font-family:arial,sans-serif"> </span><em style="line-height:16px;font-style:normal;font-family:arial,sans-serif;font-weight:bold">smoothing</em>) instead of xapian current implementation which sums individual contribution for terms.</div>


<div> </div><div>LM requires collection frequency for the words while doing smoothing, i checked using xapian-delve tool these statistics are maintained by the backend <a href="http://data-structures.it" target="_blank">data-structures. it</a> will save implementation of language  model from adding any data-structure to the backend.</div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

BTW, on the very next page to the one you link to says:<br>

<br>

    Nevertheless, there is perhaps still insufficient evidence that its<br>

    performance so greatly exceeds that of a well-tuned traditional<br>

    vector space retrieval system as to justify changing an existing<br>

    implementation.<br>

<br>

<a href="http://nlp.stanford.edu/IR-book/html/htmledition/language-modeling-versus-other-approaches-in-ir-1.html" target="_blank">http://nlp.stanford.edu/IR-book/html/htmledition/language-modeling-versus-other-approaches-in-ir-1.html</a><br>


<br>

Both DfR and Learning to Rank are claimed to outperform BM25 and vector<br>

space models too - do you know how LM compares to these?  I don&#39;t recall<br>

seeing any such comparisons myself.<br>

<div><br></div></blockquote><div><br></div><div>LM cliams to outperform both BM25 and Vector space model when used with KL divergence and Dirichlet smoothing.otherwise if LM is used without KL divergence and Dirichlet smoothing BM25 outperforms LM.</div>

<div><br></div><div> A comparative study which proves LM outperform BM25 model is presented by <a href="http://dl.acm.org/citation.cfm?id=1378323" target="_blank">http://dl.acm.org/citation.cfm?id=1378323</a> on page 6 ,Unigram language model based on Kullback-Leibler divergence, using Dirichlet prior smoothing gives best results and outperforms these models.</div>


<div><br></div><div>whereas in one comparison in available in <a href="http://dl.acm.org/citation.cfm?id=1813934" target="_blank">http://dl.acm.org/citation.cfm?id=1813934</a> shows DFR outperform LM. I could not get any comparison between LM and Learning to rank.</div>


<div><br></div><div>Clinchant et. al. in Bridging Language Modeling and Divergence from Randomness Models ( <a href="http://dl.acm.org/citation.cfm?id=1612342" target="_blank">http://dl.acm.org/citation.cfm?id=1612342</a> ) propose a approach which presents a log-logistic model which bridges gap between LM and DFR and have proved to be performing either comparatively and also outperform Language model on some standard document collection.</div>


<div><br></div><div>I would like to share one conjecture from initial paper of LM by Bruce croft:</div><div><br></div><div>LM is simple enough to be explained to a user at intuitive level and understanding of it will facilitate information in forming better queries.A user that understand system at intuitive level will tend to think in-terms of word which will help system distinguish document of interest from everything else.</div>


<div><br></div><div>this intuitive behavior of LM with comparable performance of BM25 might prove beneficial when xapian is used as application by real set of users to search <b>gmane</b> due to better intuitive understanding of system which is difficult to attain in other models due to inherent complex heuristics used.</div>


<div><br></div><div>Do mentors think it  will still be useful to include LM in xapian with amount of work specified by modifying matcher and implementing weight subclass?</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div><br>

&gt; Since search result diversification is its naive form performed by<br>

&gt; expanding query with different context and adding document from different<br>

&gt; context in final rank-list, thereby catering to all context of query.<br>

&gt;<br>

&gt; I was thinking if i can use the algorithm implemented in expanded set for<br>

&gt; query expansion and implement a new algorithm in Search diversification in<br>

&gt; this way query expansion feature of xapian will also get powerful.<br>

<br>

</div>Possibly, but maybe it would be better to use an approach from the<br>

literature which has already been tried and evaluated?<br>

<div><br></div></blockquote><div>  yes,sorry for vague term in my description,with new algorithm i meant algorithm not implemented in xapian but reported to be tried and evaluated in literature.</div><div><br>

</div><div>I did a little survey on the Diversification of Search results.I found algorithm categorized in two category viz. implicit and explicit.</div><div>Implicit algorithm use taxonomy or other categorization information knowledge available like taxonomy available by <a href="http://www.dmoz.org/" target="_blank">http://www.dmoz.org</a> . whereas explicit algorithm avoid using any such information and relies on queries and documents content and statistics only.</div>


<div><br></div><div>i found most of algorithm using external resources like category information available from open directory(<a href="http://www.dmoz.org/" target="_blank">http://www.dmoz.org</a>) , query reformulation available from commercial search engine, association on third party corpus,query logs of a search engine . Best approach was one which used query reformulation from commercial search engines. </div>


<div><br></div><div>Approach name: MMR</div><div>Approach description: Jaime G. Carbonell et. al. introduced MMR in search result diversification . The use of MMR, diversity-based reranking for reordering documents and producing summaries don&#39;t use any external resource and have shown considerable good results without resources.</div>


<div><br></div><div><div>Approach name: Q-Filter</div><div>Approach description: Radlinski et. al. proposed using query logs of search engine to form sub queries of given query to disambiguate user intent.Then using simple approach to add documents for all formed sub queries in rank list.</div>


</div><div><br></div><div><div>Approach name: IA-Select</div><div>Approach description: Agrawal et. al. proposed using taxonomy for documents and query.Adding documents to the ranklist using greedy algorithm based on quality of document and category.algorithm due to greedy nature ensures all category(or user intent) will be served and don&#39;t unnecessary add document if good document for a category have been added based on utility function defined by them.</div>


</div><div><br></div><div>Approach name: Quad </div><div>Approach description: Quad framework exploit query reformulations provided  by three major Web search engines (WSEs) as a means to uncover different query aspects.</div>


<div><br></div><div>We can use Q-Filter and Quad approach with association calculated on third party corpus available on any standard corpus.</div><div><br></div><div>PFA attached image for comparison of proposed approach.</div>

<div><br></div>

<div><div>Which approach would  best fit and can be used to implement search result diversification in xapian ?</div></div><div><br></div><div>Is it a good idea to use resource hungry techniques like using taxonomy information for search diversification in xapian .</div>


<div><br></div><div>References:</div><div><a href="http://dl.acm.org/citation.cfm?id=1526761" target="_blank">http://dl.acm.org/citation.cfm?id=1526761</a></div><div><a href="http://dl.acm.org/citation.cfm?id=1498766" target="_blank">http://dl.acm.org/citation.cfm?id=1498766</a></div>


<div><a href="http://dl.acm.org/citation.cfm?id=1772780" target="_blank">http://dl.acm.org/citation.cfm?id=1772780</a></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div>

<br>

&gt; I would be glad if mentors from xapian community can comment on my idea of<br>

&gt; implementing Language modelling technique and search result diversification<br>

&gt; as a project in scenario of Open Source Search Engine Library( xapian).<br>

&gt; Will implementing these techniques help xapian as a open source project?<br>

<br>

</div>Diversification of results is certainly something people have asked<br>

about before - it would be useful in a lot of applications I think.<br><br></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Language Modelling is interesting.  I think if it can be fitted into the<br>

framework we already have it would be worthwhile to implement it.  I&#39;m<br>

not so sure if it would require a second matcher or even more to be<br>

implemented.  That would be a lot of extra code to maintain.<br><br></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Cheers,<br>

    Olly</blockquote></div><br clear="all"><div><br></div>-- <br>with regards<br>Gaurav A.<br>