Weighting Schemes: Evaluation results

Mon Jul 25 16:30:35 BST 2016

On Mon, Jul 25, 2016 at 06:11:21PM +0530, Vivek Pal wrote:

> > We probably don't want them committed in git where they're evaluation
> > runs (because we can recreate them); a gist might be more appropriate.
> 
> Sorry, I have moved results files over to gist for each individual
> weighting scheme.
> Link: https://gist.github.com/ivmarkp/secret

You need to share the actual URL of the gist, otherwise only you can see
them I think :-)

Or just make them public; there's nothing sensitive in these, I think.

(One gist can contain multiple files, and people can then clone or
download the whole lot easily.)

> > I can't tell, but are some of those files from FIRE?
> 
> No, those files are generated each time a run is completed, and just
> contain evaluation results that are displayed on terminal.

Okay, great.

> > Can you remind me what sort of corpus you're using from FIRE for this?
> 
> The corpus we are using contains sorted news articles/stories based
> on section and time period from two different news providers; BDNews
> 24 and The Telegraph.

Great, thanks; it's worth noting this somewhere (maybe on your project
wiki page).

> > Do you have any idea what 'very long' means in this case, in terms of
> > number of terms (or maybe multiple of mean terms)
> 
> Very long documents in terms of no. of terms as specified in the paper; in
> general, where |D| is much larger than avdl.
> 
> It is mentioned in the paper that "the MAP improvements of BM25+ over BM25
> are much larger on Web collections than on the news collection. In
> particular, the MAP improvements on all Web collections are statistically
> significant." Therefore, they seem to have used four TREC collections: WT2G,
> WT10G, Terabyte, and Robust04, which represent different sizes and genre of
> text collections.

Ah. If FIRE doesn't have something that can show this suitably, then
maybe Parth can advise on access to TREC, as I know he's used some of
them in the past.

Certainly until we have something where evaluation shows an
improvement, we shouldn't change the default. It does sound like it
should be possible to find a suitable dataset to demonstrate this on,
though.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org