[Xapian-devel] Document clustering module?

Thu Sep 20 20:23:30 BST 2007

On Thu, Sep 20, 2007 at 11:12:55PM +0800, ??? ????????? ??? (Yung-chung Lin) wrote:
> +static bool test_docsim1() {
> +    Xapian::Database db(get_database("etext"));
> +    string query = "Time";
> +    Xapian::Stem stemmer("en");
> +
> +    Xapian::QueryParser qp;
> +    qp.set_database(db);
> +    qp.set_stemmer(stemmer);
> +
> +    Xapian::Enquire enq(db);
> +    enq.set_query(qp.parse_query(query));

It doesn't really matter, but we aren't trying to test the QueryParser
here so I'd just go for the simpler and slightly faster option:

    enq.set_query(Xapian::Query(stemmer("time")));

> +    Xapian::MSet matches = enq.get_mset(0, 30);
> +
> +    Xapian::DocSimCosine doc_sim;
> +    doc_sim.set_database(db);
> +    for (Xapian::doccount i = 0; i < matches.size(); i+=2) {
> +        double sim
> +            = doc_sim.calculate_similarity(matches[i].get_document(),
> +                                           matches[i+1].get_document());
> +        TEST(sim >= 0 && sim <= 1);

All the documents under consideration contain "term", so we should have
sim > 0 for DocSimCosine.

Incidentally, what properties do we actually require of similarity
measures?  We should document these so people creating their own
subclasses know what is required.

I think we definitely require:

* sim(a,b) == sim(b,a) (commutative)
* sim(a,b) >= 0

We may also want:

* sim(a,b) <= 1
* sim(a,a) == 1

Are there any others?

> +  protected:
> +    /// The database from which documents are retrieved.
> +    Database db;

I'm not sure if this should be in the base class or not.  It's certainly
possible to write a subclass which doesn't need it (e.g. count the
number of terms in common and divide by the greater number of terms)
although many will need it.  It's not much of an overhead though.

> +#include <math.h>

We recently decided to prefer `#include <cmath>' in new code (there are
semantic differences in some instances, so we're holding off a wholesale
conversion until 1.1.0 to avoid any risk of breaking existing code).

> +    for (wt_iter = wt_a.begin(); wt_iter != wt_a.end(); ++wt_iter) {
> +        wt_iter->second /= wt_a_denom;
> +    }
> +    
> +    for (wt_iter = wt_b.begin(); wt_iter != wt_b.end(); ++wt_iter) {
> +        wt_iter->second /= wt_b_denom;
> +    }

Unless I'm missing something we don't need to do this normalisation at
all, since wt_a_denom and wt_b_denom just factor out of wt_sq_sum_a,
wt_sq_sum_b, and inner_product and then cancel in the final calculation!

Are you doing it like this because you're worried about overflow or
something like that?

If we don't need to normalise here, it seems we don't need wt_a or wt_b
or wt_a_denom or wt_b_denom at all - we can just run through the
termlists of the two documents in parallel, advancing whichever
currently has the lowest term, and calculating the three sums at
once.

If you don't follow, don't worry - I can try rewriting it before I apply
it and see if it works.

Cheers,
    Olly