[Xapian-devel] Proposed changes to omindex

Fri Aug 11 06:52:59 BST 2006

Proposed changes to omindex

Currently Available Items
=========================

1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during
indexing.

2) Add the document’s last modified time to the value table (ID 0).  This would allow incremental
indexing based on the timestamp and also sorting by date in omega (SORT=0)
a. Currently I store the timestamp as a 10 byte string (left zero padded UNIX time string) i.e.
0969492426
b. However, for maximum space savings it could be stored as a 4 byte string in big endian format
with a get/set utility function to handle the conversion if necessary.

3) Add the document’s MD5 to the value table as a 16 byte string (binary representation of the
digest) (ID 1).  This could be used as a secondary check for incremental indexing (i.e. if the
file was touched but not changed don’t replace it) and also to collapse duplicates (COLLAPSE=1). 
The md5 source code is from the GNU testutils-2.1 package.

4) For files that require command line utility processing (i.e. pdftotext) I have added a
--copylocal option.  This allows the file to be digested while being copied to the local drive and
then the command line utility processes the local file saving multiple reads across the network. 
If we want to expand this it could be used to build a local cache/backup/repository.  For my use I
was thinking of putting the files under source control (svn) but that is another discussion
thread.

5) I would also recommend storing the full filename in the document data.
file=/mnt/vol1/www/sample.html.  I have a purge utility that cleans out documents that are no
longer found on the file system using this information.  FYI: I am currently migrating to a MySQL
metadata repository that will move information like this out of the search index; it also
preserves metadata on complete index rebuilds and allows users to add additional information that
may not be contained in the actual document.

Future Items
============
6) Stream indexer.  Instead of reading the entire file into memory, process it line by line.  This
should make indexing large files more efficient.

7) Clean up the fixme’s in mime type handlers i.e. // FIXME: run pdfinfo once and parse the output
ourselves.  I woudl use pcre to extract the desired text.

8) Change the way stemmed terms are added to the database.  Remove the R prefix from raw terms and
only write stemmed terms to the DB if they differ from the original term, prefixing them with Z?. 
If stemming was set to none this would reduce the current term tables (termlist, postlist, and
position) by about 50%. The query parser would have to be modified to use the same rules.

Let me know if you are interested in including any of these changes in Xapian.

Thanks,
Trink