Hi,<div><br></div><div> I have implemented initial prototype of the Xapian::Weight subclass for Unigram Language Modelling to support UnigramLM weighing in xapian.Other changes include adding collection_frequency to TermFreqs struct to store collection frequency of terms and some changes to support it xapian Framework,Changing simplesearch.cc to search using UnigramLMWeight class.</div>
<div><br></div><div>Following issues have not being addressed in this patch(I am working on following issues):</div><div><br></div><div>1. Log trick for handling multiplication for LM need to made more robust than just adding some random number to avoid rejecting document due to negative value returned by log.</div>
<div><br></div><div> Since each term contribution is probability(b/w 0 and 1). Hence finding log will result in negative value and eventually rejection of document.Hence a random linear weight has been added.It need to be addressed by using log diffrent bases and some other techniques.</div>
<div><br></div><div>Discussion about log trick needed to be used are here for reference: <a href="http://comments.gmane.org/gmane.comp.search.xapian.devel/1857">http://comments.gmane.org/gmane.comp.search.xapian.devel/1857</a></div>
<div><br clear="all"><div>2. Setting tighter bound for the get_maxpart() to make matching process more efficient.</div><div><br></div><div>3. Adding other smoothing factors to the UnigramLMWeight implementation.</div><div>
<br></div><div><br></div><div>PFA 5 patches for the initial prototype implementation of Unigram Language Model in Xapian.</div><div><br></div><div>Thanks, </div><div><br></div>-- <br>with regards<br>Gaurav A.<br>
</div>