[Xapian-devel] Complete GSOC idea

Aarsh Shah aarshkshah1992 at gmail.com
Sat Mar 1 04:42:36 GMT 2014


Hi everyone,

                                   I am thinking of working on  the
following ideas for my GSOC proposal based on my discussions with Olly and
my own understanding. Rather than focusing on an entire perftest module, I
have decided to focus on implementing performance tests for  weighting
schemes based on a wikipedia dump and in addition to that, build a
framework to measure the accuracy and relevance of new and old weighting
schemes.

* Measuring the relevance and accuracy of  weighting schemes.*

   - The accuracy of a weighting scheme can be measured by using the
   concepts of precision and recall. :-
   http://en.wikipedia.org/wiki/Precision_and_recall
   - Once we have the static wikipedia dump in place, we can hardcode
   expected results for each query we plan to run on the data set. By
   comparing the expected results to the retrieved results for a number of
   queries for each weighting scheme, we can get a general idea of it's
   accuracy.
   - This implementation will also help determine the accuracy of new
   weighting schemes as and when they will be implemented in Xapian.

        *Profiling and Optimizing Weighting/Query Expansion Schemes*

   - Profile DFR schemes and identify/optimize bottlenecks.
   - Profile Stemming algorithms and indexing .
   - For profiling most searches which are fast, valgrind based profilers
   can be used.However, perf can be brought in for slower searches as we had
   discussed that valgrind based profilers may not be efficient for IO bound
   tasks.
   - The speed will first be tested using the Realtime:now function and
   then the profiler will be brought in if the speed appears to be too slow.
   - As mentioned on the ideas page too, a lot of the optimization can/will
   happen by mapping the forumals used to a smaller set of formulas and reduce
   the number of times computationally heavy operations such as log() are used.
   - Create a huge static data-set, preferably a Wikipedia dump.
   - Test the speed of the DFR schemes against the speed of BM25 and decide
   on a default weighting scheme. Our best bet would be a parameter free DPH
   schemes as the performance of the one with parameters depends on the input
   data too.
   - Similarly, a speed analysis of query expansion scheme will also be
   done to decide on a default query expansion scheme.These can be optimized
   too.

        I am not quite being able to decide on an ideal patch for the idea
.Please can you suggest some ideas for an ideal patch as an initial first
step to include with my proposal ?

-Regards
-Aarsh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140301/3470f83c/attachment.html>


More information about the Xapian-devel mailing list