Introduction and Doubts

James Aylett james-xapian at tartarus.org
Fri Mar 11 12:18:51 GMT 2016


On Fri, Mar 11, 2016 at 01:21:14AM +0530, nirmal singhania wrote:

> Tf-idf is most used used weighting scheme is easy to understand and has
> been used in other frameworks like lucene and many other places.
> okapi bm25(implemented in xapian) is theoretically better/improved measure
> than tf-idf

Okay, so doesn't that suggest using BM25 instead of tf-idf? Or even
making it configurable, since Xapian already has an abstraction for
weighting schemes, so the user can plug in whatever they want (with a
sensible default)?

> i am looking into various other weighting scheme which are there in
> xapian or can be implemented like TF-ICF(term frequecy inverse
> corpus frequency),TF-RF(term frequency-relevance frequency)

If there's a useful weighting scheme to add for clustering that Xapian
doesn't support, that could be a useful 'warmup' piece of work, before
the main project starts, to help you get used to developing Xapian.

> for evaluating the speed and accuracy of final clustering system we
> can benchmark it against various other algos like k-means,HAC based
> on the measures mentioned in previous
> mail.(purity,F-measure,Entropy,F-Measure,Overall Similarity,Relative
> Margin,Variance Ratio)

Great. Sounds like you have lots of helpful detail for your proposal
on this :-)

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org



More information about the Xapian-devel mailing list