[Xapian-discuss] [q] phrase replacement in thousands of text files

V S P toreason at fastmail.fm
Mon May 21 03:59:11 BST 2012


Hello, first post.
I searched through docs and examples but did not see this particular
problem answered.

I have thousands of text files total size about 17GB.
Within a file I need for find a phrase (typically up to 3 words together
separated by spaces, commas, and non period punctuation mark).

I have a dictionary of about 3 million phrases and their replacement.

So I need to replace all of the matching phrases from the dictionary
with their replacements

The most brute force approach I though was
a) build an index on all of the 17GB of documents
b) for every one of the 3 million search phrases do search
c) expect to return from ( b ) xapian match where I would get the start
and end byte location in a file for every search
remember that location, and the found phrase in a 'future replacement
list'

d) when done , use the 'future replacement list' -- to perform the
replacement operation


Obviously couple of problems 3 million times search 17GB worth of text
Second -- I do not understand how (if at all possible) to get the
start/end offset of the found phrase within the source file
Third  how do I insure that the phrase words are together (and the one
with period between them is not concidered a find).


thank you in advance for any suggestions,
vsp





-- 
http://www.fastmail.fm - Access all of your messages and folders
                          wherever you are




More information about the Xapian-discuss mailing list