GSoC 2017 Project Proposal

Richhiey Thomas richhiey.thomas at
Mon Mar 20 05:47:49 GMT 2017

> This is definitely interesting. However it's not terribly worth having
> this unless/until we have more than one clustering system to evaluate, is
> it? (Beyond the uniform/random one, although I guess if any clusterer
> performs _worse_ than that, it's a bad sign!)

This module can still help us with the KMeans clusterer implemented and
since I would like to implement a hierarchal clusterer, it could help in
relative comparison too.
Before, I was looking at both internal as well as external evaluation
techniques. But I guess a regular use case of this API will not provide a
way to have ground truth labels for documents. Thus internal evaluation
techniques would be the better option. I would thus like to change my
approach and introduce a few internal clustering evaluation techniques.
They are:

1) Silhouette coefficient
2) Dunn Index
3) Root Mean Square Standard Deviation
4) Calinski-Harabasz index
5) Davies-Bouldin index

I would be trying to aim at implementing this Performance analysis module
by the end of the community bonding period. Also, once this is set up, it
would be easier to code and evaluate newer clusterers, with minimal changes
in API.

> Do you think we'll need to implement several of these, for different uses?
> If not, is there a reason you think LSA will work best? You talk about
> eliminating words that occur rarely in documents — could we have a
> quick-and-dirty approach that looks at the within-corpus frequency of terms?

I did try the quick-and-dirty approach that you mentioned, but looking at
within-corpus frequency removes a lot of words that could otherwise add
meaning. This will also be corpus-dependent, and hence a bad idea.
For now, removing stop words and the stemmed duplicates within the Document
has helped, but it could be better to add functionality for semantic
dimensionality reduction like LSA.
LSA would work best because over various other methods, it is a text mining
tool rather than a statistical tool.
I'm not sure whether implementing more than one technique will be in the
scope of GSoC, but I see no harm in believing that we could use more. So we
could create a class like DimReduction and sub-class it for implementing
various techniques.

> Do you have a particular approach you think is a good one? Are you
> thinking agglomerative or divisive?

 I was thinking of agglomerative clustering where we start from individual
clusters going to a cluster containing all documents.
This would be fairly simple to implement since we have the API fairly in
place. We would only need to find a way to merge two clusters while going
up the hierarchy tree.
So as we start off, we can initialize multiple clusters having their own
documents with the Cluster class, and then at each step, merging upwards by
merging two of the Cluster object contents into one and calculating new
cluster centroids. We just need to do this iteratively till the number of
clusters end up being one.

I will document all my findings from this conversation and my previous
ideas into a proposal and send it soon.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Xapian-devel mailing list