[Xapian-discuss] Ticket #342: Omega: Add option to avoid reindexing unchanged files

Olly Betts olly at survex.com
Thu May 21 03:49:51 BST 2009


On Wed, May 20, 2009 at 08:34:15PM +0100, James Aylett wrote:
> On Wed, May 20, 2009 at 08:16:37PM +0100, Richard Boulton wrote:
> 
> > Checking if the file size has changed as well as the date is another
> > approach - it doesn't cause all changes to be noticed, of course, but
> > it's a lot cheaper than computing the MD5 sum of the file (if you've
> > done a stat(), you've already got the size available).

You already have the size available for the file on disk, but not for
the "old" file.  Currently that would require fetching the document
data (which is quite an I/O overhead when many files haven't changed).

It could be stored as a value I guess, which would trade off some I/O
overhead here for extra disk space usage, and slightly increased I/O
when adding a new file.  Perhaps it would be useful to have the size
available as a value for sorting and range restrictions, but there's
still the overhead of reading it for each file which otherwise seems
unchanged.

I'm not convinced this extra work is worth it for most people,
especially since webservers don't check filesizes for
"If-Modified-Since".

It could be an option perhaps - how to handle already indexed files
could be one of: always reindex, check time+size+md5, check time+size,
check time, never reindex (to allow prioritising getting completely
new documents into the index and/or deleted documents out of the index).
Or something like that.

> Can we get the inode cheaply? Where supported, I can't think of
> many practical situations where it would change but the contents not.

You can get the inode number from stat() on platforms and filesystems
where "inode" is a meaningful concept.  But moving a large tree of files
to a new partition (e.g. restoring them from backup after a disk failure)
would then force a full reindex, which is unhelpful.

Also the inode for the "old" file isn't currently stored (and isn't
useful for other purposes), so there's extra I/O and diskspace overhead.

And again, webservers don't check the inode for "If-Modified-Since"...

Cheers,
    Olly



More information about the Xapian-discuss mailing list