[Xapian-discuss] [q] phrase replacement in thousands of text files

Olly Betts olly at survex.com
Mon May 21 06:46:18 BST 2012


On Sun, May 20, 2012 at 10:59:11PM -0400, V S P wrote:
> Obviously couple of problems 3 million times search 17GB worth of text

I'm not sure I see why this a problem unless the run time of this
is highly sensitive.

> Second -- I do not understand how (if at all possible) to get the
> start/end offset of the found phrase within the source file

Xapian doesn't store the byte offsets (only word offsets), so this isn't
possible.  It can narrow down the number of files you need to go and
look at for each replacement though, which could make quite a difference
if many of the replacements are rarely done.

> Third  how do I insure that the phrase words are together (and the one
> with period between them is not concidered a find).

When indexing, pass each chunk of text between periods to
TermGenerator::index_text(), calling increase_termpos() after each
index_text() call.  Then phrases can't span a period.

Cheers,
    Olly



More information about the Xapian-discuss mailing list