Introduction and Doubts

Thu Mar 10 00:17:29 GMT 2016

I was not sharing it on maling list because i thought that someone can use
all ideas i proposed in their GSOC proposal.

Surely i will contribute to xapian project.
sorry if that was against the rules
The algorithm is not developed by me but after having much research on
various clustering techniques.
I found that there is a new algorithm called CLUBS(Clustering Using Binary
Splitting) which gives better results than kmeans++ and hierarchical
agglomerative clustering.
It is faster and produces good results based on various metrics of cluster
quality.

the algorithm works in following way
The first phase of the algorithm is
divisive, as the original data set(in this case, set of search documents to
cluster) is split recursively into miniclusters through successive
binary splits: the algorithm’s second phase is agglomerative since these
miniclusters
are recombined into the final result. Further, the approach induces during
execution a dynamic
hierarchical grid that will better fit the dataset w.r.t. classical grid
approaches that exploit
a fixed grid instead. Finally, the algorithm exploits the analytical
properties of the
Quadratic Sums of Squares (SSQ in the following) function to minimize the
cost of
merge and split operations.

you can refer the paper for more details about time complexity and
performance
http://web.cs.ucla.edu/~zaniolo/papers/chp%253A10.1007%252F978-3-642-37456-2_10.pdf

for implementing it,we can use Documentsource class in our previous
clustering approach and create a binary tree
and perform and topdownsplitting and then bottomup merging.
First we have to implement feature extraction from text document(TFIDF
would be a good choice) which is implemented in xapian weighting schemes.
Then we will implement function to compute distances between documents
based on normalized TF-IDF Matrix.
Based on distances we will initially assign cluster and improve on it
using topdownsplitting
and then bottomup merging.

We will compute SSQ of a cluster from all points in a cluster. and then
Improve our clustering based on SSQ measures.

The functions for topdownsplitting  will include computeAverageDeltaSSQ to
average the SSQ of all points in cluster.
The functions for bottomupmerging will include computeweightedSSQ.
.

Please Give your valuable suggestions.
Have a Nice Day.

Regards,
Nirmal Singhania
B.tech III Yr

On Wed, Mar 9, 2016 at 10:33 PM, James Aylett <james-xapian at tartarus.org>
wrote:

> On Wed, Mar 09, 2016 at 10:27:38AM +0530, nirmal singhania wrote:
>
> > based on my some time of Research,I have in mind a clustering algorithm
> > that can overcome Quality issues of K-means(and its variants) and Speed
> > Issues of Hierarchical Agglomerative Clustering.
> > Theoretically it can work O(n) and Can produce results better than HAC
> > based on various metrics.
> > I can't discuss it on mailing-list but you say we can discuss more about
> it
> > and its implementation in xapian in PM.
>
> Hi, Nirmal -- welcome to Xapian! What do you mean when you say that
> you can't discuss it on the mailing list? If you can't discuss it in
> our public forums, then you aren't going to be able to contribute that
> work to Xapian, are you? (Since we'll need it documented clearly,
> which will likely include useful details of that discussion.)
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160310/5ac9af7f/attachment-0001.html>