[Xapian-devel] Patch for Initial Prototype implementation of Unigram Langauage Modelling in xapian-core.

Gaurav Arora gauravarora.daiict at gmail.com
Sun Apr 15 02:09:33 BST 2012


Hi,

  I have implemented initial prototype of the  Xapian::Weight subclass for
Unigram Language Modelling to support UnigramLM weighing in xapian.Other
changes include adding collection_frequency to TermFreqs struct to store
collection frequency of terms and some changes to support it xapian
Framework,Changing simplesearch.cc to search using UnigramLMWeight class.

Following issues have not being addressed in this patch(I am working on
following issues):

1. Log trick for handling multiplication for LM need to made more robust
than just adding some random number to avoid rejecting document due to
negative value returned by log.

     Since each term contribution is probability(b/w 0 and 1). Hence
finding log will result in negative value and eventually rejection of
document.Hence a random linear weight has been added.It need to be
addressed by using log diffrent bases and some other techniques.

Discussion about log trick needed to be used are here for reference:
http://comments.gmane.org/gmane.comp.search.xapian.devel/1857

2. Setting tighter bound for the get_maxpart() to make matching process
more efficient.

3. Adding other smoothing factors to the UnigramLMWeight implementation.


PFA 5 patches for the initial prototype implementation of Unigram Language
Model in Xapian.

Thanks,

-- 
with regards
Gaurav A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Added-UnigramLMWeigh-to-the-Xapian-Weight-Subclass.c.patch
Type: application/octet-stream
Size: 19053 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment-0005.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Made-changes-to-remote-backend-class-to-accomodate-c.patch
Type: application/octet-stream
Size: 2392 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment-0006.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-Adding-dependency-classunigramlmweight.Plo-for-unigramlmweight.cc.patch
Type: application/octet-stream
Size: 12685 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment-0007.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0004-Removed-a-implementation-bug-of-Collection-Frequency.patch
Type: application/octet-stream
Size: 3272 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment-0008.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0005-Minor-indentation-and-comment-changes-in-the-code.patch
Type: application/octet-stream
Size: 5735 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment-0009.obj>


More information about the Xapian-devel mailing list