[Xapian-discuss] Extract common phrases from index

Olly Betts olly at survex.com
Thu Nov 6 13:38:34 GMT 2008


On 06/11/2008, Josh <leftdrive at gmail.com> wrote:
> Is it possible to extract common phrases from an index?
>
> Basically, I'd like to index my document set and find words that
> commonly appear next to each other.
>
> For example if I a set of recent political news articles I may expect
> to find "John McCain", "Sarah Palin" and "Barrack Obama".
>
> Ideally I'd like to specify any number of words (all 2 word phrases,
> all 3 word phrases).
>
> Possible? Crazy? Point me the right direction.

I don't see how to do it efficiently if you just index "normally" - the database
doesn't store the positional information in a way which would make this
especially easy.

However, you could index word n-grams as terms, and then look at the
most frequent of these terms.  So you'd have an n-gram term for "Sarah
Palin" - say:

XNsarah palin

If you want to pull out names then you can cut the number of word n-grams
down a lot by only generating them for capitalised words (EuroFerret used a
trick like this - instead of storing phrase data, it just indexed the 12 most
common 2-gram word pairs for each document as extra terms).

Or perhaps require that the first and last word are capitalised, so that 3+ word
phrases with prepositions are handled (e.g. "City of London").

But if you're doing this, you could just use a hash table to do the counting - I
don't see a particular advantage to doing it with Xapian.

Cheers,
    Olly



More information about the Xapian-discuss mailing list