[Xapian-discuss] [q] phrase replacement in thousands of text files
V S P
toreason at fastmail.fm
Mon May 21 03:59:11 BST 2012
Hello, first post.
I searched through docs and examples but did not see this particular
problem answered.
I have thousands of text files total size about 17GB.
Within a file I need for find a phrase (typically up to 3 words together
separated by spaces, commas, and non period punctuation mark).
I have a dictionary of about 3 million phrases and their replacement.
So I need to replace all of the matching phrases from the dictionary
with their replacements
The most brute force approach I though was
a) build an index on all of the 17GB of documents
b) for every one of the 3 million search phrases do search
c) expect to return from ( b ) xapian match where I would get the start
and end byte location in a file for every search
remember that location, and the found phrase in a 'future replacement
list'
d) when done , use the 'future replacement list' -- to perform the
replacement operation
Obviously couple of problems 3 million times search 17GB worth of text
Second -- I do not understand how (if at all possible) to get the
start/end offset of the found phrase within the source file
Third how do I insure that the phrase words are together (and the one
with period between them is not concidered a find).
thank you in advance for any suggestions,
vsp
--
http://www.fastmail.fm - Access all of your messages and folders
wherever you are
More information about the Xapian-discuss
mailing list