[Xapian-discuss] Ticket #342: Omega: Add option to avoid reindexing unchanged files

Olly Betts olly at survex.com
Wed May 20 12:16:59 BST 2009


On Wed, May 20, 2009 at 11:42:00AM +0100, Srijon Biswas wrote:
> I _think_ that the implementation may be incorrect here... please correct me
> if I am wrong (I have just seen the final patch as linked in the ticket, not
> really tried it out):
> 
> dir A:
> - file A1 [content C1] [last modified M1]
> - file A2 [content C2] [last modified M2]
> - file A3 [content C3] [last modified M3]
> 
> Index dir A.
> 
> Then move A1 -> A2, create a new A1 with new content. So we get:
> 
> dir A:
> - file A1 [content C4] [last modified M4]
> - file A2 [content C1] [last modified M1]
> - file A3 [content C3] [last modified M3]
> 
> Index dir A.
> 
> In the above scenario, as per the fix, am I correct in assuming that A2 will
> not get updated (which it should), but A1 will? Please correct me if I am
> wrong.

Yes, this is true.  Similarly if you restore an older file from backup,
or directly mess with timestamps after updating a file (e.g. with touch
--reference).

But then omindex is aimed at indexing web sites, and webservers will
also suffer from similar issues with "If-Modified-Since:" requests if
you do these things, so it's prudent to avoid doing these things in
web-served document trees anyway.

I bet for most users, the large speed gain outweighs these corner cases,
but they ought to be documented.

> Maybe the test for changed content should depend on the md5sum and not on
> the date (even though this does add more burden than just checking the last
> mod date). Something roughly like this:

Yes, it's quite a lot more work, but it would save some work.  A fuller
solution to ticket #250 would reduce the gain here, but there would
probably still be some:

http://trac.xapian.org/ticket/250

> Also, right now the md5 is being taken for the raw file in all cases, and
> "processed" text in only for text files (where the md5 is for content that
> has been changed a bit). It does not seem that taking the md5 of the
> processed text is of any use at this point ( and where it does become
> useful, maybe store two values - one md5 for raw file and another one for
> the content of the file after passing through the mime type handler).

That's a bug - the handling of non-UTF-8 text patch came after the md5
one, and before that this was calculating the md5 sum of the "raw" file.
I'll fix that in a moment.

Cheers,
    Olly



More information about the Xapian-discuss mailing list