<div dir="ltr">Tf-idf is most used used weighting scheme is easy to understand and has been used in other frameworks like lucene and many other places.<div>okapi bm25(implemented in xapian) is theoretically better/improved measure than tf-idf and</div><div> </div><div>i am looking into various other weighting scheme which are there in xapian or can be implemented like TF-ICF(term frequecy inverse corpus frequency),TF-RF(term frequency-relevance frequency)</div><div><br></div><div><br></div><div>for evaluating the speed and accuracy of final clustering system we can benchmark it against various other algos like k-means,HAC based on the measures mentioned in previous mail.(purity,F-measure,<span style="font-size:12.8000001907349px">Entropy,F-Measure,Overall Similarity,Relative Margin,Variance Ratio)</span></div><div><br></div><div>Please give your suggestions</div><div>Have a Nice day<br><div><div><br></div><div><br></div></div></div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Regards,<div>Nirmal Singhania</div><div>B.tech III Yr</div></div></div></div></div></div></div></div>

<br><div class="gmail_quote">On Thu, Mar 10, 2016 at 5:46 PM, James Aylett <span dir="ltr"><<a href="mailto:james-xapian@tartarus.org" target="_blank">james-xapian@tartarus.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Thu, Mar 10, 2016 at 05:47:29AM +0530, nirmal singhania wrote:<br>

<br>

> I was not sharing it on maling list because i thought that someone<br>

> can use all ideas i proposed in their GSOC proposal.<br>

<br>

</span>It's usually pretty obvious to us if someone has copied parts of<br>

someone else's proposal.<br>

<span class=""><br>

> The algorithm is not developed by me but after having much research<br>

> on various clustering techniques.  I found that there is a new<br>

> algorithm called CLUBS(Clustering Using Binary Splitting) which<br>

> gives better results than kmeans++ and hierarchical agglomerative<br>

> clustering.  It is faster and produces good results based on various<br>

> metrics of cluster quality.<br>

<br>

</span>I've only skimmed the paper for now, but it certainly looks<br>

interesting. Do you have a reason for picking TFIDF for feature<br>

extraction? Are there other approaches that might make sense? You may<br>

want to include in your project proposal how you intend to evaluate<br>

the speed and accuracy of the final clustering system.<br>

<br>

It sounds like you have a good handle on how you're going to go about<br>

implementing CLUBS in Xapian. Having a detailed plan in your proposal<br>

is a good way of demonstrating that you've thought through the<br>

practical aspect of adding a feature. When writing up your proposed<br>

timeline, remember to break things into small pieces -- no more than a<br>

week at most, and you'll probably find some that come in shorter than<br>

that.<br>

<div class="HOEnZb"><div class="h5"><br>

J<br>

<br>

--<br>

  James Aylett, occasional trouble-maker<br>

  <a href="http://xapian.org" rel="noreferrer" target="_blank">xapian.org</a><br>

</div></div></blockquote></div><br></div>