GSOC-2016 Project : Clustering of search results

Mon Mar 14 11:42:15 GMT 2016

On Mon, Mar 14, 2016 at 02:09:13AM +0530, Richhiey Thomas wrote:

> The way the paper has been written I guess is the main source of your
> confusion. Let me provide a paper that explains this same concept in a way
> that is easier to understand. I was confused by eq (3) that you mentioned
> too. Here it is :
> http://www.sau.ac.in/~vivek/softcomp/clustering%20PSO+K-means.pdf

Ah, that's helpful -- thanks!

> The whole point of this approach is to use global search of PSO and fast
> convergence of K-means together.
> So we take every particle and assign it k randomly initialized cluster
> centroids.

>From the above paper, it sounds like they are randomly chosen
documents as the initial centroids.

> For every particle, we consider the cluster centroids that we have assigned
> to that particle and assign all other documents to the closest of these
> cluster centroids.

This is one of the details that I think escaped me when reading the
first paper. It's much clearer in the above paper.

> Through this, we can obtain a fitness value which is
> calculated by finding the summation of distance between all the documents
> and all cluster centroids and taking its reciprocal or making some other
> fitness function that we can define to suit or needs better. Using this
> fitness function, the next location can be found for the particle.

By moving each centroid for each particle, using the PSO iteration
equation?

> > As I understand it, though, there are still some parameters to this
> > algorithm: the maximum iterations, the inertia weight and two
> > 'gravity' constants for the velocity calculation in (1), and K. Is
> > that correct? Is there anything we can do to predict a useful value
> > for any of them?
> 
> I dont know of any ways to predict an optimum value for any of these
> parameters though I'm sure there will be some mathematical way to do so.
> But I think since we know how each of these parameters would affect the
> clustering, we can find the correct values that work well when we test this
> algorithm against the kinds of datasets this algorithm will deal with later
> on.

If there are good default values that will work for most people, that
would be fine (that's what we do for the parameters to BM25, for
instance; in more extreme situations you have to change from the
defaults, but most of the time you can just accept them and get
reasonable behaviour). If the documentation can give an idea of when
you'd want to change from the defaults, that would be even better.

> > > May 23 - June 6 (Official coding period starts now) :
> > > Define all the classes and functions that will be necessary for the
> > > implementation of the project
>
> I will be having my university exams in this period and hence, the
> amount of time I will be able to devote to the project will be kind
> of restricted during these two weeks. That's the only reason I let
> this part go slow. It still shouldn't take that much time, but just
> incase it does, the project wont lag behind.

Okay, that's obviously fine -- just note that in your timeline
alongside what you want to get done during that period. (Which might
be nothing. There's some benefit to just concentrating on your exams
and coming back fresh afterwards!)

> To break things down further, we can look at things like this
> June 7-June 10
> Create a class to represent particles and write code to initialize each
> particle in the dataset with cluster-centroids
> June 11 - June 16:
> Implement code to find the fitness value for each particle and finding the
> new location of the particle.
> June 17-June 20:
> Test the code against any dataset which contains documents so that we can
> see how the code will be finding the initial centroids and so that we can
> tweak the initial parameters to get good performance.

It'd be good to break the tests into two groups: one very fine
grained, testing individual pieces of functionality (such as the
fitness calculation, and calculating the new particle location); one
higher level, testing the overall algorithm. Depending on how your
classes work, the fine grained tests may not need to know about
Xapian::Document, which might make them easier to write.

Either way, I'd strongly recommend writing the fine grained tests
either before or at the same time as the code they test.

> Improve functionality was just meant to say that incase there were portions
> that werent implemented completely during the above phase, it can be
> completed now. If not, this period can be completely devoted to writing
> test cases, or taking care of exception handling and bug removal, to make
> sure the code written is in good shape,

And again: if you write the tests as you go along, you don't need to
think like this. Writing tests afterwards tends to result in less good
tests, because you'll believe that the code already works. Writing the
tests earlier means you're using the tests to tell you when the code
works, ie when the work is complete.

It's fine to have some time in the timeline for catching up and making
sure everything's fine, though. I'm just trying to nudge you towards
writing tests early rather than late :-)

> June 28 - July 4:
> Using cluster centroid vectors from PSO and implement assigning documents
> to the closest centroids.

Surely you'll have a way of assigning documents to the closest
centroids in a particle from the PSO work earlier? So carrying the
best particle across as the initial centroid set for K-means should be
fairly straightforward.

> July 4 - July 12:
> Reassigning cluster centroids and finishing up whatever is left in the
> complete implementation of the algorithtm
>
> I think this period will be essential for writing some test cases,
> concentrating on making sure no bugs creep in and on keeping the code
> indented and clean and exception free.

On things like code indentation, you may want to look at
clang-format. It may take a little while to get a format definition
file that matches Xapian's code layout, but you can then run it
regularly so you aren't spending ages fixing things up later. (The
sooner you fix any indentation issues, the faster you'll get familiar
with indenting things in the style of Xapian.)

(We don't have a clang-format file at the moment, but if you were to
make one as part of this project, we could include it for people in
future.)

> I've tried to breakdown the timeline out here but I'm sure that this is
> still not perfect enough. So I'd love it if you could tell me where I am
> going wrong and what more detail I could provide so I can think about it
> and get back to you.

What you have now is a lot better than your previous timeline --
hopefully you feel that it's more structured as well. I think at this
point the best move is to update your draft, and (in a few hours) open
an application on the GSoC website, at which point we can discuss
things further.

> Also, as of now I'm not able to think of any stretch goals with the project
> but I'll let you know if I come up with something worth the while!

If you don't, I wouldn't worry about it too much. You could spend more
time evaluating options (including maybe looking at cosine instead of
euclidian distance), which would result in a more comprehensive
explanation of the options for future users. And you may discover
other ideas while actually doing the project.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org