GSOC-2016 Project : Clustering of search results

Sat Mar 12 10:57:55 GMT 2016

On Mon, Mar 7, 2016 at 8:58 PM, James Aylett <james-xapian at tartarus.org>
wrote:

> On Mon, Mar 07, 2016 at 01:36:43AM +0530, Richhiey Thomas wrote:
>
> Our GSoC guide has an application template
> <https://trac.xapian.org/wiki/GSoCApplicationTemplate> which you
> should use to structure your proposal. It has some recommendations on
> how you should lay out and think about a proposal, particularly in
> terms of planning your timeline, which in the past has proven one of
> the keys to a successful project.
>
> Thanks James!
Below I write a raw version of my proposal for Clustering of Search Results
based on our previous mails.

CLUSTERING OF SEARCH RESULTS:

Motivation:
My main motivation for applying to this project is that clustering is a
topic that I have always liked in machine learning and I have already
worked with the different parts of the project, namely PSO and K-means,
that I am going to propose. Thus, this project is a way to help bring my
knowledge in these algorithms to life in terms of a real world application
like Xapian.
Apart from this, effective clustering of documents is a very important
feature in the effective retrieval and query processing speeds can be
increased by using this feature.

Project Idea:
I would like to propose a hybrid PSO and K-means clustering algorithm for
document clustering in Xapian.

PSO stands for Particle Swarm Optimisation and is a technique which is used
for finding the sub optimal solutions in a solution space.It is based on
the way swarms of particles which are initially randomly oriented, will
eventually follow the particle which has the best position relative to some
fitness measure which is employed to find the better and best positions in
the entire solution space.

K-means is a partitioning clustering algorithm which is used to classify
data into cluster based on various metrics. K-means algorithm starts with
initially selecting any k random vectors in the search space and assigning
them as cluster centroids. Then, every other vector is assigned to its
closest cluster, which is found out by using the euclidian distance
metric.This is continued till convergence.

The only problem with K-means is selecting the initial clustering centroids
because a certain initial configuration will always give the same result.
Thus the problem of finding the most optimum clusters using K-means breaks
down into finding an optimal initial configuration to start with to get the
best clustering possible. The proposed approach will be independent of
initial cluster centroids and can help K-means from being trapped in a
local optimum, but use its fast convergence along with global search
feature of PSO.

The algorithm for the given apporach will be as follows:
1) Considering every document to be a particle, initialize each particle to
contain K randomly selected cluster centers, at the same time, randomly
choose K cluster centers for K-means algorithm. The process of K cluster
centers being chosen in PSO and K-means is realized by randomly assigning
each data vector to a cluster and computing the cluster center.

2) For each particle compute the fitness value according to the fitness
equation

3) Compare the i-th particle‟s fitness value with particle‟s best solution
pid_i, the better candidate solution encountered by particle i, if current
value is better than pid_i, set pid_i equal to the current particle.

4) Compare particle‟s fitness value with the population‟s overall previous
best pgd, the best candidate solution encountered by all particles, if
current value is better than pgd, set pgd to the current particle‟s value
and location.

5) At the same time, sse, the sum of distance between each datavector and
its cluster center in K-means algorithm is computed

6) Compare pgd and 1/sse; choose the big one as pgd.

7) Update the velocity and location of each particle using the new pgd

8) Assign each data vector to the closest cluster based on the new location
of each particle, and then recalculate the cluster center

9) Repeat above steps till maximum iterations is exceeded

This approach helps K-means be independent of initial cluster centroids and
helps find better quality clusters by combining PSO's ability to provide
global search in a solution space and the fast convergence speed of K-means

For implementing this, we will find the TermLists for every document and
then pass this in a TermListGroup along with the docids to a function for
the initialization step of the algorithm. In this step, we can create a
vector of K-means centroids and every particle can contain a vector of
cluster centers used by PSO.
A function will be written to calculate the best value for each particle
and best value for the whole swarm using the term vectors for each document
and the equations given in the paper on this algorithm.
A euclidian distance metric needs to be implemented in a similar way to the
DocSimCosine class and with this, the we can find the closest cluster
center and add it to the clusters. This can be done using a vector of
centroids and vector of all other documents containing the docids. We can
find the centroid a document is closest to an then assign it to the cluster
using a struct similar to the ClusterAssignment struct used in the current
implementation. In the next iteration, the new center can be found out from
the current docs in each cluster and we can do this again.

The link to the paper I referred to for proposing this approach is:
http://ictactjournals.in/paper/IJSCPaper_206-208.pdf

Project Timeline:

April 22 - May 22 (Before coding period starts) : Finish any more research
required by talking to the mentors and any other resources. Have a clear
idea of how to go about the project before the coding period

May 23 - June 6 (Official coding period starts now) :
Define all the classes and functions that will be necessary for the
implementation of the project

June 7 - June 20 :
Implement and test the first phase which is a part dealing with the PSO
algorithm

MID TERM EVAL:

June 21 - June 27 :
Making further changes to the code to improve functionality, exception
handling and bug removal

June 28 - July 10 :
Implement and test the second phase which deals with the K-means being
actually used for clustering.

July 11 - July 16 :
Making further changes to the code to improve functionality, exception
handling and bug removal

July 17 - July 28 :
Implement or use a concrete measure to test the performance of the newly
implemented clustering algorithm against the previously implemented one in
terms of speed and accuracy

July 28 - August 1:
Find out drawbacks that may come out while testing the proposed solution
and tackle it during this time by changing some parts of the code that may
increase speed or efficiency.

August 2 - August 16:
Cleaning the written code, improving the documentation or writing
documentation afresh, identifying any bugs that may come out and writing
test before the final submissions

Im sorry if this is too long but it would be great about how I can improve
on this and make it better in anyway or whether there would be any points
that you would like me to elaborate in more detail.
Thanks :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160312/b0e1dfe5/attachment-0001.html>