[Xapian-devel] Test Dataset for performance and accuracy analysis

Wed Mar 5 22:42:26 GMT 2014

Hi Parth,

I think this solves my problem. One part of  my project is to build a
performance test module which not only tests the speed but also the
relevance of the weighting schemes to determine if we can use a better
default weighting scheme. Gaurav has already written an evaluation module
in Xapian.So, I think understanding how to use it , and then feeding it the
data you've suggested after understanding the structure of the data will do
the job. I will definitely come back to you if I need more help on the
theory side of judging relevance. Thank you so much for your time. :)

-Regards
-Aarsh

On Wed, Mar 5, 2014 at 4:43 PM, Parth Gupta <pargup8 at gmail.com> wrote:

> Hi Aarsh,
>
> Yes, its very important to test the implemented algorithms on the
> benchmark collections. Most of the evaluation forums TREC, CLEF, INEX,
> FIRE, NTCIR release corresponding datasets. The most suitable one for you
> would be an ad-hoc collection which comprise of a document collection,
> topics (query-set) and qrels (relevance judgements).
>
> As these evaluation forums put a lot of effort (and money) in preparing
> them, they are not easily and freely available. Mostly such datasets are
> free for research if you are registered with them or you participate in
> their tracks.
>
> I see that INEX ad-hoc collection for 2009 and 2010 is available on
> registering, so you can register with them, log in and download the dataset
> along with queries and qrels for you. The link is:
>
> https://inex.mmci.uni-saarland.de/
>
> Use the adhoc collection, it was also used for testing Letor
> implementation and BM25 in 2011 during GSoC (
> http://trac.xapian.org/wiki/GSoC2011/LTR/Notes#IREvaluationofLetorrankingscheme
> ).
>
> Cheers,
> Parth.
>
>
> On Tue, Mar 4, 2014 at 4:46 PM, Aarsh Shah <aarshkshah1992 at gmail.com>wrote:
>
>> Hi Parth,
>>
>>                                 I implemented DFR algorithms  in Xapian
>> as a part of GSOC last year under the mentorship of Olly. This year, I want
>> to work on analyzing and optimizing the performance of the DFR algorithms
>> and comparing them with BM25.I also want to work on profiling the query
>> expansion schemes and test the relevance(precision and recall) / speed(time
>> taken) of the algorithms .
>>                                  However, for this, I need a well defined
>> data set containing a considerable amount of textual data, query logs
>> containing queries that can be run on it, a set of relevant or expected
>> documents which can be compared with the actual results to measure the
>> relevance of the schemes. Please can you help me with this ? Thank you so
>> much for your time.
>>
>> -Regards
>> -Aarsh
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140306/c7188a13/attachment.html>