[Xapian-discuss] double post: terms weight access

Olly Betts olly at survex.com
Fri Sep 10 21:42:57 BST 2004


On Fri, Sep 10, 2004 at 04:16:09PM -0400, Georges Dupret wrote:
> I have been trying hard to find out how to access the indexed documents
> weight vectors without success. I found the query weights and the total
> weights of documents, but not the individual weights. Could somebody
> give me a hint?

I think you're probably looking for the Within Document Frequency (wdf).
You can call get_wdf() on a TermIterator for get the wdf for each
term in a document in turn, or call it on a PostingIterator to get
the wdf for a term in each document it indexes in turn.

> My first objective is to compare documents (find almost duplicated
> documents).

If you're trying to find documents like a given document, running that
document as a query works well.  If you're trying to find groups of
similar documents, you're probably going to need to implement some sort
of clustering.  "Single Link Clustering" is worth considering as it can
be implemented to run in "only" O(n^2) - that's pretty good compared to
other common clustering algorithms.

Cheers,
    Olly



More information about the Xapian-discuss mailing list