[Xapian-devel] Patch for Initial Prototype implementation of Unigram Langauage Modelling in xapian-core.

Olly Betts olly at survex.com
Tue Apr 17 03:36:48 BST 2012


On Sun, Apr 15, 2012 at 06:39:33AM +0530, Gaurav Arora wrote:
>   I have implemented initial prototype of the  Xapian::Weight subclass for
> Unigram Language Modelling to support UnigramLM weighing in xapian.Other
> changes include adding collection_frequency to TermFreqs struct to store
> collection frequency of terms and some changes to support it xapian
> Framework,Changing simplesearch.cc to search using UnigramLMWeight class.
> 
> Following issues have not being addressed in this patch(I am working on
> following issues):
> 
> 1. Log trick for handling multiplication for LM need to made more robust
> than just adding some random number to avoid rejecting document due to
> negative value returned by log.

BTW, log() in C/C++ is natural logarithm (so base e), so 10 seems
particularly arbitrary to add.  Log to base 10 is log10().

I'm not sure what the best answer is here though.

> PFA 5 patches for the initial prototype implementation of Unigram Language
> Model in Xapian.

Thanks for the patches.  They look good, though I didn't try them out
yet.  Three minor things:

You shouldn't commit the .Plo files - they're generated during the
build.

It's only really meaningful to mark a constructor as "explicit" if it
takes (or has optional parameters such that it can take) a single
argument.  The "explicit" marking means it would be use to implicitly
convert a value.  So if you had an array class that could be initialised
with a size:

    Array::Array(size_t size);

If you don't mark that as explicit, then the user could pass an integer
where an Array was expected, and the compiler would create a temporary
array and pass it in, which isn't something you want to happen for this
sort of case.

And in the final patch some of the comments aren't actually multi-line
but instead are really one long line which looks like a multi-line
comment if viewed wrapper at 80 columns.  If you look at the diff itself
you will probably see what I mean.

Cheers,
    Olly



More information about the Xapian-devel mailing list