Introduction and Doubts
James Aylett
james-xapian at tartarus.org
Fri Mar 11 12:18:51 GMT 2016
On Fri, Mar 11, 2016 at 01:21:14AM +0530, nirmal singhania wrote:
> Tf-idf is most used used weighting scheme is easy to understand and has
> been used in other frameworks like lucene and many other places.
> okapi bm25(implemented in xapian) is theoretically better/improved measure
> than tf-idf
Okay, so doesn't that suggest using BM25 instead of tf-idf? Or even
making it configurable, since Xapian already has an abstraction for
weighting schemes, so the user can plug in whatever they want (with a
sensible default)?
> i am looking into various other weighting scheme which are there in
> xapian or can be implemented like TF-ICF(term frequecy inverse
> corpus frequency),TF-RF(term frequency-relevance frequency)
If there's a useful weighting scheme to add for clustering that Xapian
doesn't support, that could be a useful 'warmup' piece of work, before
the main project starts, to help you get used to developing Xapian.
> for evaluating the speed and accuracy of final clustering system we
> can benchmark it against various other algos like k-means,HAC based
> on the measures mentioned in previous
> mail.(purity,F-measure,Entropy,F-Measure,Overall Similarity,Relative
> Margin,Variance Ratio)
Great. Sounds like you have lots of helpful detail for your proposal
on this :-)
J
--
James Aylett, occasional trouble-maker
xapian.org
More information about the Xapian-devel
mailing list