Introduction and Doubts

James Aylett james-xapian at tartarus.org
Thu Mar 10 12:16:19 GMT 2016


On Thu, Mar 10, 2016 at 05:47:29AM +0530, nirmal singhania wrote:

> I was not sharing it on maling list because i thought that someone
> can use all ideas i proposed in their GSOC proposal.

It's usually pretty obvious to us if someone has copied parts of
someone else's proposal.

> The algorithm is not developed by me but after having much research
> on various clustering techniques.  I found that there is a new
> algorithm called CLUBS(Clustering Using Binary Splitting) which
> gives better results than kmeans++ and hierarchical agglomerative
> clustering.  It is faster and produces good results based on various
> metrics of cluster quality.

I've only skimmed the paper for now, but it certainly looks
interesting. Do you have a reason for picking TFIDF for feature
extraction? Are there other approaches that might make sense? You may
want to include in your project proposal how you intend to evaluate
the speed and accuracy of the final clustering system.

It sounds like you have a good handle on how you're going to go about
implementing CLUBS in Xapian. Having a detailed plan in your proposal
is a good way of demonstrating that you've thought through the
practical aspect of adding a feature. When writing up your proposed
timeline, remember to break things into small pieces -- no more than a
week at most, and you'll probably find some that come in shorter than
that.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org



More information about the Xapian-devel mailing list