[Xapian-devel] Ideas for allowing specification of weighing scheme for Eset

aarsh shah aarshkshah1992 at gmail.com
Thu Feb 7 20:07:21 GMT 2013


Hey guys ,Hi  :) I am  working on a hack which will allow the user to
specify a weighing scheme (along  with the parameters ,  if he does not not
want to use the default values)  to build the Eset  (rather than using the
hard coded  TradWeight scheme with default  k=1 ) as Olly had suggested
that we can probably get better terms (a more relevant Eset) for  query
expansion if we use say something like BM25  (or allow the user to use a
self coded scheme) for ranking the terms .

I read up the code for the proxy,internal and iterator classes of Eset and
Mset to get a feel of how those sets work.I then traced the working of
Enquire::get_eset( )  (understood it well other than how a Termlist tree is
built ) and Enquire::get_mset( ) (didn't understand this one completely,got
lost during Multimatch::get_mset()) .I also read up the code for
Xapian::Weight (both proxy and internal class) and the codes of BM25 and
TradWeight classes .

The hack now seems fairly straightforward as the only difference between
BM25 and TradWeight (as far as ranking terms to build an Eset is concerned)
is the replacement of ( k1*L + f  )  by ( k1 ( b*L + (1-b) ) in the
denominator because it seems to me that as we are  ranking terms based on
documents  ( rather than the other way round ), we do not need to include
components like q/(k3+q) (because we do not wish to include terms  already
present in the query into the Eset and so the within query frequency does
not matter ) or 2 * k2* nq / (1+L) as the length of the query is not needed
in any way to build the Eset (Please do correct me if I am wrong about any
assumptions Ive made so far ) .

So,in order to use BM25 for weighing terms for Eset,we only need to modify
the "multiplier" data member of the Expandstats class and then the final
weight can be returned by ExpanWeight::get_weight( ) as (multiplier*tw)
where tw will obviously be same for both the weighing
schemes.Thus,depending on the weighing scheme and the parameters specified
by the user in Enquire.get_eset( ) ,  multiplier can be calculated
differently.This is fairly simple to implement.However,I have yet to figure
out how to allow the user to specify a weighing scheme coded by him for
building the Eset . Please help with that.

This is the summary of what all Ive read and planned.Please let me know if
I am wrong somewhere or if I can make improvements to any of this .Thank
you for the awesome documentation of the code base ,it really helped a lot
. :)

Once I'm done with this hack and writing it's relevant documentation and
tests,my next aim is to start working on incorporating  DFR schemes in
Xapian as we do not have them as yet and they appear to be very interesting
for building both Eset and Mset as they don't require parameters.

-Regards
-Aarsh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130208/dfc15596/attachment.htm>


More information about the Xapian-devel mailing list