<div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div>Hello devs.<br><br></div>I would like to propose how I plan to go about improving and getting a system that can be integrated into Xapian in this GSoC for the clustering branch.<br><br></div>I have identified three areas of work which were not touched last time.<br><br></div>1) Automated Performance Analysis<br></div>I had roughly implemented 2 evaluation techniques previously (Distance b/w document and centroids within clusters and Silhouette coefficient) but I hadn't implemented them within Xapian, and thus it wasn't possible to automate the process of evaluating the clustering results in the ClusterSet. It is thus important to implement cluster evaluation techniques within a module (as a ClusterEvaluation class) so that users can get output on how they can improve their clustering by passing in the ClusterSet (and the labels if necessary).<br><br></div>The cluster evaluation techniques that I would like to consider are :<br></div> a) Silhouette coefficient<br></div> b) Adjusted Rand Index<br></div> c) Fowlkes Mallows index<br></div> d) F - Measure<br></div> e) Homogeneity, Completeness and V-Measure<br><br></div>2) Dimensionality Reduction<br></div>Due to high dimensionality of text documents, it is necessary to have atleast one semantic dimensionality reduction technique. For this, I would like to implement Latent Semantic Analysis for dimensionality reduction of input document vectors.<br><span class="gmail-inline_editor_value"><span class="gmail-rendered_qtext">LSA
transforms the original data in a different space so that two
documents/words about the same concept are mapped close (so that they
have higher cosine similarity). LSA achieves this by Singular Value
Decomposition (SVD) of term-document matrix.<br></span></span></div><div><span class="gmail-inline_editor_value"><span class="gmail-rendered_qtext">What I have found currently is that when we eliminate words that occur rarely in documents, we can have the algorithm run very fast. The main problem in runtime performance stems out of document vectors ending up to be very high dimensional.<br></span></span></div><div><span class="gmail-inline_editor_value"><span class="gmail-rendered_qtext"><br></span></span></div><span class="gmail-inline_editor_value"><span class="gmail-rendered_qtext">3) Hierarchical Clustering<br></span></span></div><span class="gmail-inline_editor_value"><span class="gmail-rendered_qtext">Since the clustering API is already in place, I would like to implement a hierarchical clusterer to cluster the search results.<br><br></span></span></div><span class="gmail-inline_editor_value"><span class="gmail-rendered_qtext">Currently, I have created a new PR for reviewing the work done previously so that it can be merged as soon as possible and trying to optimize the code and find out different bottlenecks, so that speed can be improved.<br><br></span></span></div><span class="gmail-inline_editor_value"><span class="gmail-rendered_qtext">It would be great to have feedback on what everyone thinks about this so that I can re-implement or improve things and make them better.<br><br></span></span></div><span class="gmail-inline_editor_value"><span class="gmail-rendered_qtext">Thanks. :)<br></span></span></div>