[Xapian-discuss] Term extraction with Xapian
charlie at juggler.net
Tue Feb 14 15:44:45 GMT 2006
Olly Betts wrote:
>> Let's say I have a raw text of 300 words. I want to extract terms
>> (nouns/phrases) like "ipod nano", "sony z1", "tom cruise", etc
>> I wonder how I could do that with Xapian (which provide really good
>> performance!) using its termlist and maybe some fuzzy logic operators ?
> If you can pull out the noun phrases and add them as terms at index
> time, you can use relevance feedback to do the filtering (via the
> Xapian::Expand class). There are GPL part of speech taggers, but
> I've not tried any of them. You might be able to get by with some
> heuristics (e.g. capital letters, words containing numbers) to pick
> suitable word pairs.
We've got a library that Richard wrote that does this kind of thing,
called AyeAye. It uses various heuristics to extract terms from plain
text or HTML.
More information about the Xapian-discuss