[Xapian-discuss] Term extraction with Xapian

Olly Betts olly at survex.com
Tue Feb 14 11:14:35 GMT 2006


On Sun, Feb 12, 2006 at 11:42:44AM +0200, David Levy wrote:
> I've been successfully using Xapian/Omega for several monthes on my website
> to provide product catalog search functionality.
> But now, I have a new need and I can't figure out if Xapian can meet it :
> I want to reproduce the term extraction algorithm provided by "Yahoo! Term
> extraction WS" (
> http://developer.yahoo.net/search/content/V1/termExtraction.html), which is
> limited to 5000 queries is day - not enough for me :(.

Interesting, I'd not come across this before.

It looks like it's probably a part-of-speech tagger to identify noun
phrases, followed by some sort of filtering to pick the most relevant
ones.

> Let's say I have a raw text of 300 words. I want to extract terms
> (nouns/phrases) like "ipod nano", "sony z1", "tom cruise", etc
> 
> I wonder how I could do that with Xapian (which provide really good
> performance!) using its termlist and maybe some fuzzy logic operators ?

If you can pull out the noun phrases and add them as terms at index
time, you can use relevance feedback to do the filtering (via the
Xapian::Expand class).  There are GPL part of speech taggers, but
I've not tried any of them.  You might be able to get by with some
heuristics (e.g. capital letters, words containing numbers) to pick
suitable word pairs.

Cheers,
    Olly



More information about the Xapian-discuss mailing list