[Xapian-devel] Complete GSOC idea
Aarsh Shah
aarshkshah1992 at gmail.com
Tue Mar 4 15:28:57 GMT 2014
Hi Olly,
Am asking Parth if he can help me with the dataset containing query logs
and expected results.Also, is the evaluation module fully functional ? I
saw that some issues are still open on it. Also, I initially thought I
would write the query log and expected results set by hand for some
wikipedia articles but realize now that you have a point as we need to test
on a large number of articles.
-Regards
-Aarsh
On Tue, Mar 4, 2014 at 5:26 PM, Olly Betts <olly at survex.com> wrote:
> On Sat, Mar 01, 2014 at 10:12:36AM +0530, Aarsh Shah wrote:
> > I am thinking of working on the following ideas for my GSOC proposal
> > based on my discussions with Olly and my own understanding. Rather
> > than focusing on an entire perftest module, I have decided to focus on
> > implementing performance tests for weighting schemes based on a
> > wikipedia dump and in addition to that, build a framework to measure
> > the accuracy and relevance of new and old weighting schemes.
>
> I mentioned this on IRC (not sure if it was before or after you sent
> this mail), but for the benefit of anyone reading who wasn't on IRC
> then, we do already have an evaluation module which was originally
> written by Andy MacFarlane, and further worked on by Gaurav Arora:
>
> https://github.com/samuelharden/xapian-evaluation
>
> > * Measuring the relevance and accuracy of weighting schemes.*
> >
> > - The accuracy of a weighting scheme can be measured by using the
> > concepts of precision and recall. :-
> > http://en.wikipedia.org/wiki/Precision_and_recall
> > - Once we have the static wikipedia dump in place, we can hardcode
> > expected results for each query we plan to run on the data set.
>
> How would you get a list of suitable queries to run against a wikipedia
> dump? I've not seen public query logs for wikipedia.
>
> How would you get the "expected results for each query"? Producing a
> set of relevance judgements is rather time consuming. If the relevance
> judgements are poor quality, the conclusions of the evaluation become
> untrustworthy.
>
> I suspect it would be better to use an existing dataset which included
> queries and relevance judgements - Parth might know if there's one we
> could use.
>
> > *Profiling and Optimizing Weighting/Query Expansion Schemes*
> >
> > - Profile DFR schemes and identify/optimize bottlenecks.
> > - Profile Stemming algorithms and indexing .
> > - For profiling most searches which are fast, valgrind based profilers
> > can be used.However, perf can be brought in for slower searches as we
> had
> > discussed that valgrind based profilers may not be efficient for IO
> bound
> > tasks.
> > - The speed will first be tested using the Realtime:now function and
> > then the profiler will be brought in if the speed appears to be too
> slow.
> > - As mentioned on the ideas page too, a lot of the optimization
> can/will
> > happen by mapping the forumals used to a smaller set of formulas and
> reduce
> > the number of times computationally heavy operations such as log()
> are used.
> > - Create a huge static data-set, preferably a Wikipedia dump.
> > - Test the speed of the DFR schemes against the speed of BM25 and
> decide
> > on a default weighting scheme. Our best bet would be a parameter free
> DPH
> > schemes as the performance of the one with parameters depends on the
> input
> > data too.
> > - Similarly, a speed analysis of query expansion scheme will also be
> > done to decide on a default query expansion scheme.These can be
> optimized
> > too.
> >
> > I am not quite being able to decide on an ideal patch for the
> idea
> > .Please can you suggest some ideas for an ideal patch as an initial first
> > step to include with my proposal ?
>
> I'd suggest trying out profiling something, to get a feel for how the
> profiling tools work, and for how long the process of finding a
> bottleneck and fixing it takes.
>
> Cheers,
> Olly
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140304/ac0c3da0/attachment.html>
More information about the Xapian-devel
mailing list