GSOC-2016 Project : Clustering of search results

James Aylett james-xapian at tartarus.org
Sat Mar 12 18:08:44 GMT 2016


On Sat, Mar 12, 2016 at 04:27:55PM +0530, Richhiey Thomas wrote:

> Below I write a raw version of my proposal for Clustering of Search Results
> based on our previous mails.

Hi, Richhiey. Thanks for putting this together ahead of the formal
start of applications, and sharing it with us -- and it's really not
too long! Project proposals for something that will last the summer
are naturally going to be long and get into a lot of detail, so don't
shy away from that.

Comments inline (and I've chopped out bits where I don't have anything
useful to add, but it should be obvious from context what I'm
discussing).

Do please note that before submitting this you'll need to put some
work in to make it match our template
(https://trac.xapian.org/wiki/GSoCApplicationTemplate). For instance,
you haven't explicitly answered the 'who will benefit' question; you
hint it at a little in your motivation section, but being as clear as
possible about that question often helps when considering other
details (such as 'how fast is fast enough', which is a key issue with
clustering).

> The algorithm for the given apporach will be as follows:
[snip]
> The link to the paper I referred to for proposing this approach is:
> http://ictactjournals.in/paper/IJSCPaper_206-208.pdf

I'd like to ask a few questions about the paper, which hopefully you
can answer, because I'm a little confused by it in places.

For instance, in equation (3) it looks to me like a measure with two
free variables (i and j) is being defined in terms of two free
variables (j and m) and one bound one (i). Is the summation actually
supposed to be over m? (Then within the summation x_i is a dw
dimensional vector, as is c_j, ie they're both feature vectors, one
representing a single item and one representing a cluster centroid.)

I may be misreading something fundamentally here, because aspects of
the combined algorithm just don't make sense to me. It doesn't help my
understanding that the paper seems to use overlapping references for
the particle positions and clusters (eg in (3), (4) and I think (5)),
and the cluster centres that are being used as particle data (at the
bottom of p207). I think rewriting it with different variables might
make this a lot clearer. (That may be the main source of my confusion,
in fact!)

As I understand it, though, there are still some parameters to this
algorithm: the maximum iterations, the inertia weight and two
'gravity' constants for the velocity calculation in (1), and K. Is
that correct? Is there anything we can do to predict a useful value
for any of them?

> For implementing this, we will find the TermLists for every document and
> then pass this in a TermListGroup along with the docids to a function for
> the initialization step of the algorithm. In this step, we can create a
> vector of K-means centroids and every particle can contain a vector of
> cluster centers used by PSO.

You mean that you're using the term lists of each document to generate
the particle's position vector in a term-based feature space? (The
initial centroids for K means and the particle cluster centers are all
randomly initialised, correct?)

(That's what I'm expecting, I just want to be sure I'm understanding
both the paper and what you've written well enough!)

> Project Timeline:

Please don't be put off by the comments I'm making about the timeline
-- writing a plan like this that stretches over several months is
incredibly hard, even if you've done lots of them before. You've done
a good job of starting to break the problem down and think about how
you're going to manage that over the summer, but I think if you break
it down even further you'll end up with a more useful timeline that
lets you worry only about the work you're doing right now rather than
having to think about where you are overall as well.

> April 22 - May 22 (Before coding period starts) : Finish any more research
> required by talking to the mentors and any other resources. Have a clear
> idea of how to go about the project before the coding period

It would be good to identify if there is specific work required during
this period. What further research do you think is required? Is there
anything else you need beyond the cited paper in order to implement
their algorithm?

> May 23 - June 6 (Official coding period starts now) :
> Define all the classes and functions that will be necessary for the
> implementation of the project

Why is this going to take two weeks? It sounds from your discussion as
if you have a good grasp of how to map this into classes
already. Remember that we really recommend having timeline sections
of no more than a week. The more you force yourself to break it down,
the more planning you have to do up front, which generally results in
a smoother project!

> June 7 - June 20 :
> Implement and test the first phase which is a part dealing with the PSO
> algorithm

Is there a way you can break this into smaller pieces? For instance,
are there datasets for PSO out there which you could use to build test
cases, ahead of implementing the algorithm itself? If not, you'll have
to generate your own, calculating the outcome for some data sets by
some other method. (You should be able to use a different
implementation to calculate the test data, because most
implementations won't have license conditions which restrict the
subsequent distribution of output of the implementation.)

With step 6 of the algorithm, it looks to me like the K-means part
that is interwoven with PSO can affect the PSO part, by modifying
pgd. Does it make sense to tackle things in this way (PSO then
K-means), or would it make more sense to break it down as steps of the
algorithm, with each calculation tested separately? (It's probably
easier to write the tests that way.)

> June 21 - June 27 :
> Making further changes to the code to improve functionality, exception
> handling and bug removal

Exception handling is really something you should do as you go
along. I'm also not clear what 'improve functionality' means in this
context. Are you concerned that the previous two weeks won't be enough
time to complete the PSO algorithm? If so, you definitely want to
break it down into smaller pieces; three weeks is generally too long
for a single stretch of work (because it's difficult to tell two weeks
in if it's going well or not).

> June 28 - July 10 :
> Implement and test the second phase which deals with the K-means being
> actually used for clustering.

Again, hopefully you can break this down further.

> July 11 - July 16 :
> Making further changes to the code to improve functionality, exception
> handling and bug removal

And again, breaking down the previous block should mean you don't need
a 'catch up' period. Don't worry if your estimates go up when you
break something down further (so something that you thought would take
about two weeks, when broken into smaller pieces, might look like
three or even four weeks).

> July 17 - July 28 :
> Implement or use a concrete measure to test the performance of the newly
> implemented clustering algorithm against the previously implemented one in
> terms of speed and accuracy

Do you have an idea of what measure to use for accuracy? The paper you
cited uses between and within cluster measures, but there may be other
techniques. (Also, I don't know if you have access to any
pre-clustered test data, which would allow you to calculate a
precision rate.)

It's also going to be important to carefully choose the test data you
use for this. Is there anything in the TREC collections that would be
suitable?

> July 28 - August 1:
> Find out drawbacks that may come out while testing the proposed solution
> and tackle it during this time by changing some parts of the code that may
> increase speed or efficiency.

It's good to have a period to incorporate learning from the evaluation
phase. Is a week enough? (This would be one area where I might
consider having a longer period, and to be prepared to plan that while
or just after conducting the evaluation, based on what you learn.)

> August 2 - August 16:
> Cleaning the written code, improving the documentation or writing
> documentation afresh, identifying any bugs that may come out and writing
> test before the final submissions

GSoC has a slightly unfortunately habit of suggesting in their own
timeline that you work on documentation and tests at the end. We
strongly recommend that you work on them as you are writing the code
(with tests in particular, it's much easier to be confident with where
you are if you have tests as you go, but it's also easier to write
documentation for something you're working on right now, rather than
something you coded a month or two previously).

We generally recommend that you include some 'stretch' goals in your
project, for if everything goes particularly well and you have more
time at the end. So it would be good to have a think about what else
could fit within this project. But don't worry if you can't think of
any; I think with the evaluation phase and then working based on what
you learn from there that you should have enough to do!

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org



More information about the Xapian-devel mailing list