<div dir="ltr">Hi everyone,<br><br> I am thinking of working on the following ideas for my GSOC proposal based on my discussions with Olly and my own understanding. Rather than focusing on an entire perftest module, I have decided to focus on implementing performance tests for weighting schemes based on a wikipedia dump and in addition to that, build a framework to measure the accuracy and relevance of new and old weighting schemes.<br>
<br><div><div><b> Measuring the relevance and accuracy of weighting schemes.</b></div><div><ul><li>The accuracy of a weighting scheme can be measured by using the concepts of precision and recall. :-<br><a href="http://en.wikipedia.org/wiki/Precision_and_recall">http://en.wikipedia.org/wiki/Precision_and_recall</a><br>
</li><li>Once we have the static wikipedia dump in place, we can hardcode expected results for each query we plan to run on the data set. By comparing the expected results to the retrieved results for a number of queries for each weighting scheme, we can get a general idea of it's accuracy.</li>
<li>This implementation will also help determine the accuracy of new weighting schemes as and when they will be implemented in Xapian.</li></ul></div><div> <b>Profiling and Optimizing Weighting/Query Expansion Schemes</b><br>
<ul><li>Profile DFR schemes and identify/optimize bottlenecks. </li><li>Profile Stemming algorithms and indexing .</li><li>For profiling most searches which are fast, valgrind based profilers can be used.However, perf can be brought in for slower searches as we had discussed that valgrind based profilers may not be efficient for IO bound tasks.</li>
<li>The speed will first be tested using the Realtime:now function and then the profiler will be brought in if the speed appears to be too slow.</li><li>As mentioned on the ideas page too, a lot of the optimization can/will happen by mapping the forumals used to a smaller set of formulas and reduce the number of times computationally heavy operations such as log() are used.</li>
<li>Create a huge static data-set, preferably a Wikipedia dump.</li><li>Test the speed of the DFR schemes against the speed of BM25 and decide on a default weighting scheme. Our best bet would be a parameter free DPH schemes as the performance of the one with parameters depends on the input data too.</li>
<li>Similarly, a speed analysis of query expansion scheme will also be done to decide on a default query expansion scheme.These can be optimized too.</li></ul><div> I am not quite being able to decide on an ideal patch for the idea .Please can you suggest some ideas for an ideal patch as an initial first step to include with my proposal ?<br>
<br>-Regards<br>-Aarsh</div></div></div><div><b> </b></div></div>