GSOC-2016 Project : Clustering of search results

Mon Mar 7 15:28:33 GMT 2016

On Mon, Mar 07, 2016 at 01:36:43AM +0530, Richhiey Thomas wrote:

> My questions are:
> 1) Can you direct me on how to convert this raw idea into a proposal in
> context to Xapian with more detail? What areas do I focus on?

Our GSoC guide has an application template
<https://trac.xapian.org/wiki/GSoCApplicationTemplate> which you
should use to structure your proposal. It has some recommendations on
how you should lay out and think about a proposal, particularly in
terms of planning your timeline, which in the past has proven one of
the keys to a successful project.

> 2) It would be great if you could elaborate a little on the performance
> evaluation part that I haven't been able to follow too well.

'Performance' is a tricky word in information retrieval, so I'll break
this into two pieces: speed and quality.

As the project notes say, the previous attempt at implementing
clustering was far too slow to be practically useful. So that's the
speed side: we want to be able to cluster a fairly large number of
documents quickly (which would need some thought -- do we want to be
able to cluster 1000 documents in under a second? 10,000 in a handful
of seconds? or might 1000 documents in a handful of seconds be
sufficient?).

Quality can be judged in a number of ways, but we're generally trying
to produce 'good' clusters as a human with knowledge of the subject
area would create. There's some discussion of how you might evaluate
this in Introduction to Information Retrieval, section 16.3 p356 (or
online at
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html).

It's perhaps worth pointing out that Hearst (2009, p200) suggests that
monothetic clusters 'may be easier for users to understand', although
it doesn't cite any specific work to back up this claim. But that may
argue that a K-means based approach isn't necessarily going to be the
most helpful in all cases; there may be other approaches worth
considering instead. (That entire section on clustering is worth
reading if you have access to the book.)

Hearst, MA (2009) 'Search User Interfaces', CUP, Cambridge.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org