[Xapian-devel] Complete GSOC idea

Tue Mar 4 15:28:57 GMT 2014

Hi Olly,

Am asking Parth if he can help me with the dataset containing query logs
and expected results.Also, is the evaluation  module fully functional ? I
saw that some issues are still open on it. Also, I initially thought I
would write the query log and expected results set by hand for some
wikipedia articles but realize now that you have a point as we need to test
on a large number of articles.

-Regards
-Aarsh

On Tue, Mar 4, 2014 at 5:26 PM, Olly Betts <olly at survex.com> wrote:

> On Sat, Mar 01, 2014 at 10:12:36AM +0530, Aarsh Shah wrote:
> > I am thinking of working on the following ideas for my GSOC proposal
> > based on my discussions with Olly and my own understanding. Rather
> > than focusing on an entire perftest module, I have decided to focus on
> > implementing performance tests for  weighting schemes based on a
> > wikipedia dump and in addition to that, build a framework to measure
> > the accuracy and relevance of new and old weighting schemes.
>
> I mentioned this on IRC (not sure if it was before or after you sent
> this mail), but for the benefit of anyone reading who wasn't on IRC
> then, we do already have an evaluation module which was originally
> written by Andy MacFarlane, and further worked on by Gaurav Arora:
>
> https://github.com/samuelharden/xapian-evaluation
>
> > * Measuring the relevance and accuracy of  weighting schemes.*
> >
> >    - The accuracy of a weighting scheme can be measured by using the
> >    concepts of precision and recall. :-
> >    http://en.wikipedia.org/wiki/Precision_and_recall
> >    - Once we have the static wikipedia dump in place, we can hardcode
> >    expected results for each query we plan to run on the data set.
>
> How would you get a list of suitable queries to run against a wikipedia
> dump?  I've not seen public query logs for wikipedia.
>
> How would you get the "expected results for each query"?  Producing a
> set of relevance judgements is rather time consuming.  If the relevance
> judgements are poor quality, the conclusions of the evaluation become
> untrustworthy.
>
> I suspect it would be better to use an existing dataset which included
> queries and relevance judgements - Parth might know if there's one we
> could use.
>
> >         *Profiling and Optimizing Weighting/Query Expansion Schemes*
> >
> >    - Profile DFR schemes and identify/optimize bottlenecks.
> >    - Profile Stemming algorithms and indexing .
> >    - For profiling most searches which are fast, valgrind based profilers
> >    can be used.However, perf can be brought in for slower searches as we
> had
> >    discussed that valgrind based profilers may not be efficient for IO
> bound
> >    tasks.
> >    - The speed will first be tested using the Realtime:now function and
> >    then the profiler will be brought in if the speed appears to be too
> slow.
> >    - As mentioned on the ideas page too, a lot of the optimization
> can/will
> >    happen by mapping the forumals used to a smaller set of formulas and
> reduce
> >    the number of times computationally heavy operations such as log()
> are used.
> >    - Create a huge static data-set, preferably a Wikipedia dump.
> >    - Test the speed of the DFR schemes against the speed of BM25 and
> decide
> >    on a default weighting scheme. Our best bet would be a parameter free
> DPH
> >    schemes as the performance of the one with parameters depends on the
> input
> >    data too.
> >    - Similarly, a speed analysis of query expansion scheme will also be
> >    done to decide on a default query expansion scheme.These can be
> optimized
> >    too.
> >
> >         I am not quite being able to decide on an ideal patch for the
> idea
> > .Please can you suggest some ideas for an ideal patch as an initial first
> > step to include with my proposal ?
>
> I'd suggest trying out profiling something, to get a feel for how the
> profiling tools work, and for how long the process of finding a
> bottleneck and fixing it takes.
>
> Cheers,
>     Olly
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140304/ac0c3da0/attachment.html>