[Xapian-devel] Document clustering module?

Thu Sep 20 23:41:24 BST 2007

On 9/21/07, Olly Betts <olly at survex.com> wrote:
> On Thu, Sep 20, 2007 at 11:12:55PM +0800, ??? ????????? ??? (Yung-chung Lin) wrote:
> > +static bool test_docsim1() {
> > +    Xapian::Database db(get_database("etext"));
> > +    string query = "Time";
> > +    Xapian::Stem stemmer("en");
> > +
> > +    Xapian::QueryParser qp;
> > +    qp.set_database(db);
> > +    qp.set_stemmer(stemmer);
> > +
> > +    Xapian::Enquire enq(db);
> > +    enq.set_query(qp.parse_query(query));
>
> It doesn't really matter, but we aren't trying to test the QueryParser
> here so I'd just go for the simpler and slightly faster option:
>
>     enq.set_query(Xapian::Query(stemmer("time")));
>

I'll fix this part.

> > +    Xapian::MSet matches = enq.get_mset(0, 30);
> > +
> > +    Xapian::DocSimCosine doc_sim;
> > +    doc_sim.set_database(db);
> > +    for (Xapian::doccount i = 0; i < matches.size(); i+=2) {
> > +        double sim
> > +            = doc_sim.calculate_similarity(matches[i].get_document(),
> > +                                           matches[i+1].get_document());
> > +        TEST(sim >= 0 && sim <= 1);
>
> All the documents under consideration contain "term", so we should have
> sim > 0 for DocSimCosine.
>
> Incidentally, what properties do we actually require of similarity
> measures?  We should document these so people creating their own
> subclasses know what is required.
>
> I think we definitely require:
>
> * sim(a,b) == sim(b,a) (commutative)
> * sim(a,b) >= 0
>
> We may also want:
>
> * sim(a,b) <= 1
> * sim(a,a) == 1

I forgot to test the properties listed here. I'll add them and the doc
in next update.

>
> Are there any others?

I am not sure if there are more.

>
> > +  protected:
> > +    /// The database from which documents are retrieved.
> > +    Database db;
>
> I'm not sure if this should be in the base class or not.  It's certainly
> possible to write a subclass which doesn't need it (e.g. count the
> number of terms in common and divide by the greater number of terms)
> although many will need it.  It's not much of an overhead though.
>

I think I will keep it since it is handy. And, as you said, it's not
much of an overhead.

> > +#include <math.h>
>
> We recently decided to prefer `#include <cmath>' in new code (there are
> semantic differences in some instances, so we're holding off a wholesale
> conversion until 1.1.0 to avoid any risk of breaking existing code).
>

I'll convert to cmath.

> > +    for (wt_iter = wt_a.begin(); wt_iter != wt_a.end(); ++wt_iter) {
> > +        wt_iter->second /= wt_a_denom;
> > +    }
> > +
> > +    for (wt_iter = wt_b.begin(); wt_iter != wt_b.end(); ++wt_iter) {
> > +        wt_iter->second /= wt_b_denom;
> > +    }
>
> Unless I'm missing something we don't need to do this normalisation at
> all, since wt_a_denom and wt_b_denom just factor out of wt_sq_sum_a,
> wt_sq_sum_b, and inner_product and then cancel in the final calculation!
>
> Are you doing it like this because you're worried about overflow or
> something like that?

Yes, I was thinking about overflow, although that probably would not
happen in normal cases. I think I will rewrite this part.

>
> If we don't need to normalise here, it seems we don't need wt_a or wt_b
> or wt_a_denom or wt_b_denom at all - we can just run through the
> termlists of the two documents in parallel, advancing whichever
> currently has the lowest term, and calculating the three sums at
> once.
>
> If you don't follow, don't worry - I can try rewriting it before I apply
> it and see if it works.
>
> Cheers,
>     Olly
>

Best,
Yung-chung Lin