[Xapian-devel] [GSoC 2014] clustering of search results

Fri Mar 21 16:09:02 GMT 2014

Hello all,

I have submitted the final proposal for this project.
Here's the link:

https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/sam_1993/5707702298738688?validated=True#

I would like to hear what's missing in this, so that I can do the necessary
changes. It would really be a great help.

Thanks,
Saksham

On Tue, Mar 11, 2014 at 6:31 PM, Olly Betts <olly at survex.com> wrote:

> On Mon, Mar 10, 2014 at 07:37:30PM +0530, Saksham Maheshwari wrote:
> > On Mon, Mar 10, 2014 at 3:59 PM, Olly Betts <olly at survex.com> wrote:
> >
> > > Exactly what approach the project takes isn't nailed down - it just
> > > seemed something which would be interesting for a student to work on,
> > > and would be useful to Xapian users.
> > >
> > > My understanding of the current clustering branch (which may not be
> > > completely accurate) is that it clusters based on a pair-wise measure
> > > of document similarity, and that the user can specify which terms from
> > > the documents are used.  I think you'd consider more than just the
> words
> > > in the query - in a typical case, the query is short and the top N
> > > documents will match all the words in it.
> >
> > So, what you are saying is that we need to
> >  1. Assign similar objects to the same subset
> >  2. Assign dissimilar objects to different subsets
> > that is, we are trying to make disjoint subsets.
>
> I think so.  While there are clustering techniques where an item can be
> in more than one cluster, or in which items have a degree of belonging
> to each cluster, I'd say a strict partition into disjoint subsets is
> likely to be most useful for presenting search results in easier to
> comprehend ways.
>
> > > It's an open question whether the project should be based on the
> > > existing code or not, but I think it should at least attempt to learn
> > > from the existing code - it would be a real shame to spend 3 months
> > > working on this only to end up with two different clustering
> > > implementations, neither of which is usable on larger sets of
> documents.
> > >
> > > I think the clustering would probably be based on the terms in the
> > > documents (I can't really think what else it would be based on).
> > > Possibly using Xapian's query expansion feature (Enquire::get_eset())
> to
> > > generate a more restricted list of "interesting" terms to consider
> would
> > > help.
> >
> > Yes, what I know is that clustering will be based on number of clusters
> > i.e. no. of disjoints sets for that particular document.
>
> I'm not sure I follow what you're saying here.  Isn't each particular
> document going to be in exactly one set?
>
> > Enquire:::get_eset() will return the expand set of related documents. The
> > corresponding "Xapian::ExpandDecider * edecider " will decide which
> > document has to be inserted in the expand set.
>
> No, get_eset() returns a set of *TERMS*, not documents.  It works
> based on a set of documents marked as relevant (an RSet).
>
> And an ExpandDecider decides which *TERMS* to allow into the expand set.
>
> What I was suggesting is that you could potentially generate an ESet and
> use that as a restricted list of terms to consider.  But generating an
> ESet does takes a little time in itself.  Perhaps it would be better to
> just pick the terms with the highest "discriminating power" (really
> common terms aren't interesting, but neither are the really rare ones -
> at the extremes, a term is no help to clustering if it is in all
> documents or if it is in only one).
>
> > So, here indirectly you are talking about is the vector space model where
> > we will measure the relatedness between the sets on the basis of their
> > Euclidean distances. Thus, using k-means algorithm ?
>
> It was just an example which I hoped would help you visualise what I was
> talking about.  I'm not trying to say the clustering should be done
> based on a vector space model.
>
> Using k-means clustering is an option, though it has the drawback that
> you need to decide how many clusters you want first, but there might be
> a different number of natural clusters in the data.  It would be more
> satisfactory to be able to say you want (say) "about 7 clusters", or
> "between 6 and 10 clusters".  Running the k-means clustering multiple
> times to try different numbers of clusters probably isn't feasible for
> our applications.
>
> > Can you please tell me how should I proceed?  What should I do to
> > start with the project ?
>
> If you've not already done so, the first step is really to check out the
> Xapian code and get it to build.
>
> If/once you've done that, I'd suggest looking through the existing
> clustering code and working out what it is doing.  If you're familiar
> with profiling tools, then you could even try profiling it to see
> where the time is spent and what might be causing it to scale poorly.
>
> > PS: Is the project-related to speed up the existing code, that is, to
> make
> > it work faster or something else,
>
> The important goal is to have a clustering feature which works at a
> usable speed.
>
> That could mean taking the existing clustering code and making it work
> faster, or it could mean replacing it with new code.  If we start
> again, I think it would be prudent to at least try to understand why
> the current code doesn't scale as we'd like.  If we don't have some
> idea why the previous attempt didn't work out, we risk repeating the
> same mistakes.
>
> > along with some good merging algorithm to merge the results.
>
> I'm confused - merging what results?
>
> Cheers,
>     Olly
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140321/f078f56a/attachment.html>