[Xapian-discuss] omindex doesn't check last_mod

Michael Trinkala mdt at trinkala.com
Tue Aug 8 05:10:48 BST 2006


I recommend storing the last modified time and the document MD5 in the value table.  I use both to
determine if re-indexing is necessary.  First comparing the last modified time and if necessary
the MD5 (some files on our system get touched without having their content modified).  These
values also work nicely for sorting results by date and collapsing duplicate entries.  This has
worked very well on a Flint database with about 2.2 million documents.  For document lookup during
indexing I use a unique key (MD5 of the full filename) stored in the term table (prefixed with F
followed by the 16 byte binary MD5).  I will gladly contribute these changes and others if the
team is interested.  I will get a list up on xapian-devel to figure out what should/shouldn't be
included.

As for excel support check out xls2cvs and catppt does a nice job with powerpoint
http://www.45.free.net/~vitus/software/catdoc/

Trink

> On Mon, Aug 07, 2006 at 06:16:53PM +0100, James Aylett wrote:
>> Because it's a little awkward, and no one has done it yet :-)
>
> James' email covers what's involved pretty well - I'd just add that a third
> approach would be to store the last mod times for documents in a separate
> file - possibly even a flat file, where the 4 (or 8?) bytes holding a
> timestamp are located at offset "<docid> * 4" - this could conceivably cut
> down IO when looking up the modification times, and would not be referenced
> at all at search time, so there would be no risk of slowing that down by
> storing unused information in the database. (Of course, you'd still have to
> look up the document ID in the xapian database given the unique ID.)
>
> Experiment is the best (only?) approach to work out what actually works in
> practice.
>
>> It probably needs an option to override this, in case atime gets
>> mangled for some reason (restore from backup, for instance).
>
> Definitely.
>
> --
> Richard
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
>
>




More information about the Xapian-discuss mailing list