[Xapian-discuss] Ticket #342: Omega: Add option to avoid reindexing unchanged files

Srijon Biswas srijon.biswas at googlemail.com
Wed May 20 11:42:00 BST 2009


Hi.

I was searching around for some documentation on Omega (a query that I
posted just yesterday) and I came across this ticket.

I _think_ that the implementation may be incorrect here... please correct me
if I am wrong (I have just seen the final patch as linked in the ticket, not
really tried it out):

dir A:
- file A1 [content C1] [last modified M1]
- file A2 [content C2] [last modified M2]
- file A3 [content C3] [last modified M3]

Index dir A.

Then move A1 -> A2, create a new A1 with new content. So we get:

dir A:
- file A1 [content C4] [last modified M4]
- file A2 [content C1] [last modified M1]
- file A3 [content C3] [last modified M3]

Index dir A.

In the above scenario, as per the fix, am I correct in assuming that A2 will
not get updated (which it should), but A1 will? Please correct me if I am
wrong.
Maybe the test for changed content should depend on the md5sum and not on
the date (even though this does add more burden than just checking the last
mod date). Something roughly like this:

- Get the url for the file.
- Read the corresponding md5 value from the db if present.
- Create the md5 for this file (I know this does not work for text files
atleast as per current code but it need not be that way - see comment
below).
- If md5 matches, then no need to do anything, else continue as normal.

Also, right now the md5 is being taken for the raw file in all cases, and
"processed" text in only for text files (where the md5 is for content that
has been changed a bit). It does not seem that taking the md5 of the
processed text is of any use at this point ( and where it does become
useful, maybe store two values - one md5 for raw file and another one for
the content of the file after passing through the mime type handler).

Thanks,
Srijon.


More information about the Xapian-discuss mailing list