GSoC 2017 Project Proposal

Richhiey Thomas richhiey.thomas at gmail.com
Thu Mar 9 05:18:58 GMT 2017


Hello devs.

I would like to propose how I plan to go about improving and getting a
system that can be integrated into Xapian in this GSoC for the clustering
branch.

I have identified three areas of work which were not touched last time.

1) Automated Performance Analysis
I had roughly implemented 2 evaluation techniques previously (Distance b/w
document and centroids within clusters and Silhouette coefficient) but I
hadn't implemented them within Xapian, and thus it wasn't possible to
automate the process of evaluating the clustering results in the
ClusterSet. It is thus important to implement cluster evaluation techniques
within a module (as a ClusterEvaluation class) so that users can get output
on how they can improve their clustering by passing in the ClusterSet (and
the labels if necessary).

The cluster evaluation techniques that I would like to consider are :
      a) Silhouette coefficient
      b) Adjusted Rand Index
      c) Fowlkes Mallows index
      d) F - Measure
      e) Homogeneity, Completeness and V-Measure

2) Dimensionality Reduction
Due to high dimensionality of text documents, it is necessary to have
atleast one semantic dimensionality reduction technique. For this, I would
like to implement Latent Semantic Analysis for dimensionality reduction of
input document vectors.
LSA transforms the original data in a different space so that two
documents/words about the same concept are mapped close (so that they have
higher cosine similarity). LSA achieves this by Singular Value
Decomposition (SVD) of term-document matrix.
What I have found currently is that when we eliminate words that occur
rarely in documents, we can have the algorithm run very fast. The main
problem in runtime performance stems out of document vectors ending up to
be very high dimensional.

3) Hierarchical Clustering
Since the clustering API is already in place, I would like to implement a
hierarchical clusterer to cluster the search results.

Currently, I have created a new PR for reviewing the work done previously
so that it can be merged as soon as possible and trying to optimize the
code and find out different bottlenecks, so that speed can be improved.

It would be great to have feedback on what everyone thinks about this so
that I can re-implement or improve things and make them better.

Thanks. :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170309/6f875819/attachment.html>


More information about the Xapian-devel mailing list