[Xapian-discuss] omindex doesn't check last_mod

James Aylett james-xapian at tartarus.org
Tue Aug 8 13:58:18 BST 2006


On Mon, Aug 07, 2006 at 09:10:48PM -0700, Michael Trinkala wrote:

> I recommend storing the last modified time and the document MD5 in
> the value table.  I use both to determine if re-indexing is
> necessary.  First comparing the last modified time and if necessary
> the MD5 (some files on our system get touched without having their
> content modified).

That's neat. I'd recommend only calculating the MD5 up to the first N
bytes of the file (where N is an appropriate number for your data and
hardware).

> For document lookup during indexing I use a unique key (MD5 of the
> full filename) stored in the term table (prefixed with F followed by
> the 16 byte binary MD5).

You can (and probably should) use Q for that, as it's a
document-unique identifying term. If it's a web-centric app, you're
better off using a URI if at all possible - cool URIs don't change,
whereas file paths do.

Of course, for a file-centric app such as desktop search, MD5 of the
filename is just as good (although you can convert to file: schema
URIs).

> I will gladly contribute these changes and others if the team is
> interested.  I will get a list up on xapian-devel to figure out what
> should/shouldn't be included.
> 
> As for excel support check out xls2cvs and catppt does a nice job
> with powerpoint http://www.45.free.net/~vitus/software/catdoc/

Cool :)

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list