<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 7, 2016 at 8:58 PM, James Aylett <span dir="ltr"><<a href="mailto:james-xapian@tartarus.org" target="_blank">james-xapian@tartarus.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">On Mon, Mar 07, 2016 at 01:36:43AM +0530, Richhiey Thomas wrote:<br><br>

</span>Our GSoC guide has an application template<br>

<<a href="https://trac.xapian.org/wiki/GSoCApplicationTemplate" rel="noreferrer" target="_blank">https://trac.xapian.org/wiki/GSoCApplicationTemplate</a>> which you<br>

should use to structure your proposal. It has some recommendations on<br>

how you should lay out and think about a proposal, particularly in<br>

terms of planning your timeline, which in the past has proven one of<br>

the keys to a successful project.<br><div class=""><div class="h5"><br>

</div></div></blockquote></div>Thanks James!</div><div class="gmail_extra">Below I write a raw version of my proposal for Clustering of Search Results based on our previous mails.</div><div class="gmail_extra"><br></div><div class="gmail_extra">CLUSTERING OF SEARCH RESULTS:</div><div class="gmail_extra"><br></div><div class="gmail_extra">Motivation:</div><div class="gmail_extra">My main motivation for applying to this project is that clustering is a topic that I have always liked in machine learning and I have already worked with the different parts of the project, namely PSO and K-means, that I am going to propose. Thus, this project is a way to help bring my knowledge in these algorithms to life in terms of a real world application like Xapian. </div><div class="gmail_extra">Apart from this, effective clustering of documents is a very important feature in the effective retrieval and query processing speeds can be increased by using this feature.</div><div class="gmail_extra"><br></div><div class="gmail_extra">Project Idea:</div><div class="gmail_extra">I would like to propose a hybrid PSO and K-means clustering algorithm for document clustering in Xapian.</div><div class="gmail_extra"><br></div><div class="gmail_extra">PSO stands for Particle Swarm Optimisation and is a technique which is used for finding the sub optimal solutions in a solution space.It is based on the way swarms of particles which are initially randomly oriented, will eventually follow the particle which has the best position relative to some fitness measure which is employed to find the better and best positions in the entire solution space.</div><div class="gmail_extra"><br></div><div class="gmail_extra">K-means is a partitioning clustering algorithm which is used to classify data into cluster based on various metrics. K-means algorithm starts with initially selecting any k random vectors in the search space and assigning them as cluster centroids. Then, every other vector is assigned to its closest cluster, which is found out by using the euclidian distance metric.This is continued till convergence.</div><div class="gmail_extra"><br></div><div class="gmail_extra">The only problem with K-means is selecting the initial clustering centroids because a certain initial configuration will always give the same result. Thus the problem of finding the most optimum clusters using K-means breaks down into finding an optimal initial configuration to start with to get the best clustering possible. The proposed approach will be independent of initial cluster centroids and can help K-means from being trapped in a local optimum, but use its fast convergence along with global search feature of PSO.</div><div class="gmail_extra"><br></div><div class="gmail_extra">The algorithm for the given apporach will be as follows:</div><div class="gmail_extra">1) Considering every document to be a particle, initialize each particle to contain K randomly selected

cluster centers, at the same time, randomly choose K

cluster centers for K-means algorithm. The process of K

cluster centers being chosen in PSO and K-means is

realized by randomly assigning each data vector to a

cluster and computing the cluster center.</div><div class="gmail_extra"><br></div><div class="gmail_extra">2) For each particle compute the fitness value according to the fitness equation</div><div class="gmail_extra"><br></div><div class="gmail_extra">3) Compare the i-th particle‟s fitness value with particle‟s

best solution pid_i, the better candidate solution

encountered by particle i, if current value is better than

pid_i, set pid_i equal to the current particle.</div><div class="gmail_extra"><br></div><div class="gmail_extra">4) Compare particle‟s fitness value with the population‟s

overall previous best pgd, the best candidate solution

encountered by all particles, if current value is better than

pgd, set pgd to the current particle‟s value and location.

</div><div class="gmail_extra"><br></div><div class="gmail_extra">5) At the same time, sse, the sum of distance between each

datavector and its cluster center in K-means algorithm is

computed</div><div class="gmail_extra"><br></div><div class="gmail_extra">6) Compare pgd and 1/sse; choose the big one as pgd.

</div><div class="gmail_extra"><br></div><div class="gmail_extra">7) Update the velocity and location of each particle using

the new pgd</div><div class="gmail_extra"><br></div><div class="gmail_extra">8) Assign each data vector to the closest cluster based on the

new location of each particle, and then recalculate the

cluster center</div><div class="gmail_extra"><br></div><div class="gmail_extra">9) Repeat above steps till maximum iterations is exceeded</div><div class="gmail_extra"><br></div><div class="gmail_extra">This approach helps K-means be independent of initial cluster centroids and helps find better quality clusters by combining PSO's ability to provide global search in a solution space and the fast convergence speed of K-means</div><div class="gmail_extra"><br></div><div class="gmail_extra">For implementing this, we will find the TermLists for every document and then pass this in a TermListGroup along with the docids to a function for the initialization step of the algorithm. In this step, we can create a vector of K-means centroids and every particle can contain a vector of cluster centers used by PSO.</div><div class="gmail_extra">A function will be written to calculate the best value for each particle and best value for the whole swarm using the term vectors for each document and the equations given in the paper on this algorithm.</div><div class="gmail_extra">A euclidian distance metric needs to be implemented in a similar way to the DocSimCosine class and with this, the we can find the closest cluster center and add it to the clusters. This can be done using a vector of centroids and vector of all other documents containing the docids. We can find the centroid a document is closest to an then assign it to the cluster using a struct similar to the ClusterAssignment struct used in the current implementation. In the next iteration, the new center can be found out from the current docs in each cluster and we can do this again.</div><div class="gmail_extra"><br></div><div class="gmail_extra">The link to the paper I referred to for proposing this approach is:</div><div class="gmail_extra"><a href="http://ictactjournals.in/paper/IJSCPaper_206-208.pdf">http://ictactjournals.in/paper/IJSCPaper_206-208.pdf</a><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Project Timeline:</div><div class="gmail_extra"><br></div><div class="gmail_extra">April 22 - May 22 (Before coding period starts) : Finish any more research required by talking to the mentors and any other resources. Have a clear idea of how to go about the project before the coding period</div><div class="gmail_extra"><br></div><div class="gmail_extra">May 23 - June 6 (Official coding period starts now) : </div><div class="gmail_extra">Define all the classes and functions that will be necessary for the implementation of the project</div><div class="gmail_extra"><br></div><div class="gmail_extra">June 7 - June 20 :</div><div class="gmail_extra">Implement and test the first phase which is a part dealing with the PSO algorithm<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">MID TERM EVAL:</div><div class="gmail_extra"><br></div><div class="gmail_extra">June 21 - June 27 :</div><div class="gmail_extra">Making further changes to the code to improve functionality, exception handling and bug removal </div><div class="gmail_extra"><br></div><div class="gmail_extra">June 28 - July 10 :</div><div class="gmail_extra">Implement and test the second phase which deals with the K-means being actually used for clustering.</div><div class="gmail_extra"><br></div><div class="gmail_extra">July 11 - July 16 :</div><div class="gmail_extra"><div class="gmail_extra">Making further changes to the code to improve functionality, exception handling and bug removal </div><div><br></div><div>July 17 - July 28 :</div><div>Implement or use a concrete measure to test the performance of the newly implemented clustering algorithm against the previously implemented one in terms of speed and accuracy</div><div><br></div><div>July 28 - August 1:</div><div>Find out drawbacks that may come out while testing the proposed solution and tackle it during this time by changing some parts of the code that may increase speed or efficiency.</div><div><br></div><div>August 2 - August 16:</div><div>Cleaning the written code, improving the documentation or writing documentation afresh, identifying any bugs that may come out and writing test before the final submissions</div><div><br></div><div>Im sorry if this is too long but it would be great about how I can improve on this and make it better in anyway or whether there would be any points that you would like me to elaborate in more detail.</div><div>Thanks :)</div></div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div></div>