[Xapian-discuss] Term extraction with Xapian

Charlie Hull charlie at juggler.net
Tue Feb 14 15:44:45 GMT 2006

Olly Betts wrote:
>> Let's say I have a raw text of 300 words. I want to extract terms
>> (nouns/phrases) like "ipod nano", "sony z1", "tom cruise", etc
>> I wonder how I could do that with Xapian (which provide really good
>> performance!) using its termlist and maybe some fuzzy logic operators ?
> If you can pull out the noun phrases and add them as terms at index
> time, you can use relevance feedback to do the filtering (via the
> Xapian::Expand class).  There are GPL part of speech taggers, but
> I've not tried any of them.  You might be able to get by with some
> heuristics (e.g. capital letters, words containing numbers) to pick
> suitable word pairs.
> Cheers,
>     Olly

We've got a library that Richard wrote that does this kind of thing, 
called AyeAye. It uses various heuristics to extract terms from plain 
text or HTML.


