[Xapian-devel] GSOC 2015 Participation | Ganesh Prabu

Sun Feb 15 11:54:40 GMT 2015

Hi Developers,

I am Ganesh Prabu pursuing my final year in computer science from SASTRA
University, India. I read through the project ideas page and i found
Clustering of Search Results to be the one that aptly fits my profile.
Before proceeding further I will introduce myself a little and my
programming background,

About :

I have excellent algorithmic skills and good grasp on Object Oriented
Design Patterns. I did my internship at KLA-Tencor where I worked on
projects involving multithreading in C# and CPP. So I have about five
months of industrial experience. I have experience coding Data mining
algorithms as part of my academics. I have worked in CUDA for generating
Mandlebrot and Julia Sets. I am good at benchmarking and always like to
find ways to improve the method.

Besides i have done several projects, some of them include Chain reaction
game (JavaScript), AI Snake. I won first place in Microsoft conducted,
intra college competition, RaspberryPi kits from KLA-Tencor for developing
an OMR reader. Besides I participate in Codechef and Hackerrank to shape my
algorithmic skills. Here is my Linkedin and Github account

https://www.linkedin.com/in/ganeshpraburavi

https://github.com/ganeshpraburavi

I started reading through the existing code and they have implemented
K-Means algo with TF-IDF as the similarity measure.

Problems in Existing Method :

1. They are not doing any dimensionality reduction.(Large features)

2. No effort in feature selection. Even if it ran successfully, it would
have resulted       in poor clusters

Solution

1. Do Dimensionality Reduction(DRT) in such a way that it reduces the
features and also select the most relevant features. [1]

2. Implement a parallel clustering algorithm like Buckshot or Suffix tree
clustering or Lingo. These clustering algos are more suitable for Web
documents  [2]

*Note: Lingo is an algorithm employed in Carrot2 for clustering of search
results from Lucene, Solr

I am yet to prepare to exact method for solving this problem. Is the idea
of parallel programming paradigm is okay? I would love to have discussion
on how it could be proceeded further.

I am very excited about this project and would be very glad to work on this
with my fullest dedication and accomplish each task specified, before the
fixed deadline.

[1] https://web.cs.dal.ca/~luo/AI2005.pdf

[2] http://project.carrot2.org/publications/wroblewski-2003-ahc.pdf

-- 
Thanks
Ganesh Prabu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20150215/4ce6cc97/attachment.html>