[Xapian-discuss] Text extractor

James Aylett james-xapian at tartarus.org
Wed Aug 10 10:16:24 BST 2005


On Tue, Aug 09, 2005 at 10:29:44PM +0200, Sebastjan Trepca wrote:

> I was wondering if you have any tips about extracting postings/terms
> from an article. Right now I have this lame extractor which just just
> splits the article with a space into terms and adds them to a
> document, but of course terms like "blah," can be problematic.
> Well, if that's even the right way to do this :) 

If you download Omega, the omindex.cc source file contains a term
generation algorithm that works well with the QueryParser built into
Xapian. It's written in C++, but should be reasonably readable and could
be converted to Python without too much difficulty, I'd hope.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list