[Xapian-devel] Complete GSOC idea
Aarsh Shah
aarshkshah1992 at gmail.com
Sat Mar 1 04:42:36 GMT 2014
Hi everyone,
I am thinking of working on the
following ideas for my GSOC proposal based on my discussions with Olly and
my own understanding. Rather than focusing on an entire perftest module, I
have decided to focus on implementing performance tests for weighting
schemes based on a wikipedia dump and in addition to that, build a
framework to measure the accuracy and relevance of new and old weighting
schemes.
* Measuring the relevance and accuracy of weighting schemes.*
- The accuracy of a weighting scheme can be measured by using the
concepts of precision and recall. :-
http://en.wikipedia.org/wiki/Precision_and_recall
- Once we have the static wikipedia dump in place, we can hardcode
expected results for each query we plan to run on the data set. By
comparing the expected results to the retrieved results for a number of
queries for each weighting scheme, we can get a general idea of it's
accuracy.
- This implementation will also help determine the accuracy of new
weighting schemes as and when they will be implemented in Xapian.
*Profiling and Optimizing Weighting/Query Expansion Schemes*
- Profile DFR schemes and identify/optimize bottlenecks.
- Profile Stemming algorithms and indexing .
- For profiling most searches which are fast, valgrind based profilers
can be used.However, perf can be brought in for slower searches as we had
discussed that valgrind based profilers may not be efficient for IO bound
tasks.
- The speed will first be tested using the Realtime:now function and
then the profiler will be brought in if the speed appears to be too slow.
- As mentioned on the ideas page too, a lot of the optimization can/will
happen by mapping the forumals used to a smaller set of formulas and reduce
the number of times computationally heavy operations such as log() are used.
- Create a huge static data-set, preferably a Wikipedia dump.
- Test the speed of the DFR schemes against the speed of BM25 and decide
on a default weighting scheme. Our best bet would be a parameter free DPH
schemes as the performance of the one with parameters depends on the input
data too.
- Similarly, a speed analysis of query expansion scheme will also be
done to decide on a default query expansion scheme.These can be optimized
too.
I am not quite being able to decide on an ideal patch for the idea
.Please can you suggest some ideas for an ideal patch as an initial first
step to include with my proposal ?
-Regards
-Aarsh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140301/3470f83c/attachment.html>
More information about the Xapian-devel
mailing list