[Xapian-discuss] Text extractor
James Aylett
james-xapian at tartarus.org
Wed Aug 10 10:16:24 BST 2005
On Tue, Aug 09, 2005 at 10:29:44PM +0200, Sebastjan Trepca wrote:
> I was wondering if you have any tips about extracting postings/terms
> from an article. Right now I have this lame extractor which just just
> splits the article with a space into terms and adds them to a
> document, but of course terms like "blah," can be problematic.
> Well, if that's even the right way to do this :)
If you download Omega, the omindex.cc source file contains a term
generation algorithm that works well with the QueryParser built into
Xapian. It's written in C++, but should be reasonably readable and could
be converted to Python without too much difficulty, I'd hope.
J
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james at tartarus.org uncertaintydivision.org
More information about the Xapian-discuss
mailing list