Introduction and Doubts

nirmal singhania nirmal.singhania at st.niituniversity.in
Thu Mar 10 19:51:14 GMT 2016


Tf-idf is most used used weighting scheme is easy to understand and has
been used in other frameworks like lucene and many other places.
okapi bm25(implemented in xapian) is theoretically better/improved measure
than tf-idf and

i am looking into various other weighting scheme which are there in xapian
or can be implemented like TF-ICF(term frequecy inverse corpus
frequency),TF-RF(term frequency-relevance frequency)


for evaluating the speed and accuracy of final clustering system we can
benchmark it against various other algos like k-means,HAC based on the
measures mentioned in previous mail.(purity,F-measure,Entropy,F-Measure,Overall
Similarity,Relative Margin,Variance Ratio)

Please give your suggestions
Have a Nice day



Regards,
Nirmal Singhania
B.tech III Yr

On Thu, Mar 10, 2016 at 5:46 PM, James Aylett <james-xapian at tartarus.org>
wrote:

> On Thu, Mar 10, 2016 at 05:47:29AM +0530, nirmal singhania wrote:
>
> > I was not sharing it on maling list because i thought that someone
> > can use all ideas i proposed in their GSOC proposal.
>
> It's usually pretty obvious to us if someone has copied parts of
> someone else's proposal.
>
> > The algorithm is not developed by me but after having much research
> > on various clustering techniques.  I found that there is a new
> > algorithm called CLUBS(Clustering Using Binary Splitting) which
> > gives better results than kmeans++ and hierarchical agglomerative
> > clustering.  It is faster and produces good results based on various
> > metrics of cluster quality.
>
> I've only skimmed the paper for now, but it certainly looks
> interesting. Do you have a reason for picking TFIDF for feature
> extraction? Are there other approaches that might make sense? You may
> want to include in your project proposal how you intend to evaluate
> the speed and accuracy of the final clustering system.
>
> It sounds like you have a good handle on how you're going to go about
> implementing CLUBS in Xapian. Having a detailed plan in your proposal
> is a good way of demonstrating that you've thought through the
> practical aspect of adding a feature. When writing up your proposed
> timeline, remember to break things into small pieces -- no more than a
> week at most, and you'll probably find some that come in shorter than
> that.
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160311/eb39a175/attachment.html>


More information about the Xapian-devel mailing list