[Xapian-devel] Complete GSOC idea

Tue Mar 4 11:56:14 GMT 2014

On Sat, Mar 01, 2014 at 10:12:36AM +0530, Aarsh Shah wrote:
> I am thinking of working on the following ideas for my GSOC proposal
> based on my discussions with Olly and my own understanding. Rather
> than focusing on an entire perftest module, I have decided to focus on
> implementing performance tests for  weighting schemes based on a
> wikipedia dump and in addition to that, build a framework to measure
> the accuracy and relevance of new and old weighting schemes.

I mentioned this on IRC (not sure if it was before or after you sent
this mail), but for the benefit of anyone reading who wasn't on IRC
then, we do already have an evaluation module which was originally
written by Andy MacFarlane, and further worked on by Gaurav Arora:

https://github.com/samuelharden/xapian-evaluation

> * Measuring the relevance and accuracy of  weighting schemes.*
> 
>    - The accuracy of a weighting scheme can be measured by using the
>    concepts of precision and recall. :-
>    http://en.wikipedia.org/wiki/Precision_and_recall
>    - Once we have the static wikipedia dump in place, we can hardcode
>    expected results for each query we plan to run on the data set.

How would you get a list of suitable queries to run against a wikipedia
dump?  I've not seen public query logs for wikipedia.

How would you get the "expected results for each query"?  Producing a
set of relevance judgements is rather time consuming.  If the relevance
judgements are poor quality, the conclusions of the evaluation become
untrustworthy.

I suspect it would be better to use an existing dataset which included
queries and relevance judgements - Parth might know if there's one we
could use.

>         *Profiling and Optimizing Weighting/Query Expansion Schemes*
> 
>    - Profile DFR schemes and identify/optimize bottlenecks.
>    - Profile Stemming algorithms and indexing .
>    - For profiling most searches which are fast, valgrind based profilers
>    can be used.However, perf can be brought in for slower searches as we had
>    discussed that valgrind based profilers may not be efficient for IO bound
>    tasks.
>    - The speed will first be tested using the Realtime:now function and
>    then the profiler will be brought in if the speed appears to be too slow.
>    - As mentioned on the ideas page too, a lot of the optimization can/will
>    happen by mapping the forumals used to a smaller set of formulas and reduce
>    the number of times computationally heavy operations such as log() are used.
>    - Create a huge static data-set, preferably a Wikipedia dump.
>    - Test the speed of the DFR schemes against the speed of BM25 and decide
>    on a default weighting scheme. Our best bet would be a parameter free DPH
>    schemes as the performance of the one with parameters depends on the input
>    data too.
>    - Similarly, a speed analysis of query expansion scheme will also be
>    done to decide on a default query expansion scheme.These can be optimized
>    too.
> 
>         I am not quite being able to decide on an ideal patch for the idea
> .Please can you suggest some ideas for an ideal patch as an initial first
> step to include with my proposal ?

I'd suggest trying out profiling something, to get a feel for how the
profiling tools work, and for how long the process of finding a
bottleneck and fixing it takes.

Cheers,
    Olly