Introduction and Doubts

James Aylett james-xapian at
Thu Mar 17 12:52:47 GMT 2016

On Tue, Mar 15, 2016 at 11:36:25PM +0530, nirmal singhania wrote:

> we will use clustering results we user clicks on similar results.

You might cluster results to display them at all. Think about the way
Google News displays things; it's clustering on story, then shows a
representative few articles from within each cluster.

> The Modules can include
> 1)Cluster(DocVector[] &V) --Returning Clusters(Hashmap with key as cluster
> no and values as Documents which belong to that cluster)

Is there a reason for a cluster to have a number? I'd have thought a
cluster is 'just' (or at least mostly) a vector of Document objects.

> 2)DocVector(Mset &M)-Return array of Tf-idf vectors from search result
> documents with each document having corresponding vector
> 3)EuclideanSim(DocVector &V1,DocVector &V2) Returns Similarity between two
> document vectors

In an earlier implementation of clustering, the similarity method
worked with a group of TermIterators (start and end of each
document). Is your approach better (more efficient, or enabling more
features) than that?

> 4)KeywordExtract(Cluster &C1) (Returing String keyword and assign it to
> cluster)
> Putting that keyword as title and all the documents in cluster returned by
> clicking it

This feels like it could be driven by a method on Cluster. What would
be the advantage of introducing another class here?

> 5)GetSimilar(Document &D) Returning Ranked Similar Documents based on
> Clustering

I'm not sure how this can work with the signature you've give. You
presumably need to have done the clustering beforehand (which requires
a query to have been run, so probably an MSet), but you aren't feeding
in the clusters.

I'm also not entirely sure what kind of interface this is
supporting. If you created an RSet from this document and re-ran the
query with that, you'd end up with a tighter set of matching documents
to cluster. Do we need a separate module to achieve this?

> One Suggestion i want is whether to use cosine similarity first or
> euclidean similarity as there is not clear cut explanation which is better.
> Based on your experience you might be the best person to guide through it.

I don't have suitable experience. Does the literature not provide a
clear recommendation?

> please tell how much details i have to add to each part about
> implementation and methodology

The main thing you need is to break the project down into sufficiently
small pieces that you can create a good timeline. Our guidance notes
provide some ideas of what we're looking for here.


  James Aylett, occasional trouble-maker

More information about the Xapian-devel mailing list