[Xapian-discuss] [q] phrase replacement in thousands of text files
Olly Betts
olly at survex.com
Mon May 21 06:46:18 BST 2012
On Sun, May 20, 2012 at 10:59:11PM -0400, V S P wrote:
> Obviously couple of problems 3 million times search 17GB worth of text
I'm not sure I see why this a problem unless the run time of this
is highly sensitive.
> Second -- I do not understand how (if at all possible) to get the
> start/end offset of the found phrase within the source file
Xapian doesn't store the byte offsets (only word offsets), so this isn't
possible. It can narrow down the number of files you need to go and
look at for each replacement though, which could make quite a difference
if many of the replacements are rarely done.
> Third how do I insure that the phrase words are together (and the one
> with period between them is not concidered a find).
When indexing, pass each chunk of text between periods to
TermGenerator::index_text(), calling increase_termpos() after each
index_text() call. Then phrases can't span a period.
Cheers,
Olly
More information about the Xapian-discuss
mailing list