[Xapian-devel] Proposed changes to omindex

Fri Aug 11 07:45:02 BST 2006

Michael Trinkala schrieb:
> Proposed changes to omindex
> 
> Currently Available Items
> =========================
> 
> 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during
> indexing.
> 
> 2) Add the document’s last modified time to the value table (ID 0).  This would allow incremental
> indexing based on the timestamp and also sorting by date in omega (SORT=0)
> a. Currently I store the timestamp as a 10 byte string (left zero padded UNIX time string) i.e.
> 0969492426
> b. However, for maximum space savings it could be stored as a 4 byte string in big endian format
> with a get/set utility function to handle the conversion if necessary.
> 
> 3) Add the document’s MD5 to the value table as a 16 byte string (binary representation of the
> digest) (ID 1).  This could be used as a secondary check for incremental indexing (i.e. if the
> file was touched but not changed don’t replace it) and also to collapse duplicates (COLLAPSE=1). 
> The md5 source code is from the GNU testutils-2.1 package.
> 
> 4) For files that require command line utility processing (i.e. pdftotext) I have added a
> --copylocal option.  This allows the file to be digested while being copied to the local drive and
> then the command line utility processes the local file saving multiple reads across the network. 
> If we want to expand this it could be used to build a local cache/backup/repository.  For my use I
> was thinking of putting the files under source control (svn) but that is another discussion
> thread.

I already have a cache_dir option in my omega.conf and successfully use 
it in omindex for recursive local zip/rar/msg/pst "virtual directories", 
last_mod checked. MSVC not supported, sorry.
I'll clean it up and post it here.
Your idea to cache the output of costly extracters, like xls2cvs and 
pdftotext seems to be also promising. But with the implemented last_mod 
check not really needed IMHO.

> 5) I would also recommend storing the full filename in the document data.
> file=/mnt/vol1/www/sample.html.  I have a purge utility that cleans out documents that are no
> longer found on the file system using this information.  FYI: I am currently migrating to a MySQL
> metadata repository that will move information like this out of the search index; it also
> preserves metadata on complete index rebuilds and allows users to add additional information that
> may not be contained in the actual document.
> 
> Future Items
> ============
> 6) Stream indexer.  Instead of reading the entire file into memory, process it line by line.  This
> should make indexing large files more efficient.
> 
> 7) Clean up the fixme’s in mime type handlers i.e. // FIXME: run pdfinfo once and parse the output
> ourselves.  I woudl use pcre to extract the desired text.
> 
> 8) Change the way stemmed terms are added to the database.  Remove the R prefix from raw terms and
> only write stemmed terms to the DB if they differ from the original term, prefixing them with Z?. 
> If stemming was set to none this would reduce the current term tables (termlist, postlist, and
> position) by about 50%. The query parser would have to be modified to use the same rules.
> 
> Let me know if you are interested in including any of these changes in Xapian.
-- 
Reini Urban
http://phpwiki.org/  http://murbreak.at/
http://helsinki.at/  http://spacemovie.mur.at/