[Xapian-devel] Proposed changes to omindex

Sat Aug 26 22:56:47 BST 2006

On Thu, Aug 10, 2006 at 10:52:59PM -0700, Michael Trinkala wrote:
> Proposed changes to omindex

One suggestion before I go into details - even if some of these patches
may not be things we'd want to include in the mainstream releases right
now, they may still be of interest to some other users.  So I'd
encourage you to offer them for download, or just post them here if they
aren't too big.  The same goes for other people with patches they're
happy to share.

> Currently Available Items
> =========================
>
> 1) Have the Q prefix contain the 16 byte MD5 of the full file name
> used for document lookup during indexing.

There are two issues here really.

The first is if the unique id should be based on the file path or the
URL.  Currently omindex uses the URL, but the file path could be used
instead.  The main difference I can see is that it would allow the URL
mappings to be changed without a reindex (providing the omega CGI
applied the mappings at search time) but I'm not sure how useful that
really is - I can't remember the last time I reconfigured the url to
file mappings on any webserver I maintain.

On the flip side, currently you can move the physical locations of
files around and change the URL mappings in the web server so the URLs
remain the same, and omindex won't have to reindex a thing.  That
actually seems a more likely scenario to me (though again I can't
remember the last time I've actually done this).

As James has said, we've discussed this before and ended up staying with
how things are, mostly because there didn't seem to be any particular
advantage to changing.

The other issue is that terms have an upper length limit, so you need a
way to cope with overly long URLs/file paths when build UID terms.
Currently we only hash if the URL is over about 240 characters.  The
problem with always using only a hash is that you can get collisions
even with modest length URLs/paths.

While this might seem a bit of a theoretical risk, there are "only"
256^16 MD5 sums, but more than 255^240 file names which would easily fit
in a term - that's more than 10^539 file names per MD5 sum, so really a
very large number of possible collisions!  Even if you only consider
filenames including alphanumerics, "_", "-", and "/", it's still more
than 10^395 file names per MD5 sum.

> 2) Add the document’s last modified time to the value table (ID 0).
> This would allow incremental
> indexing based on the timestamp and also sorting by date in omega (SORT=0)
> a. Currently I store the timestamp as a 10 byte string (left zero
> padded UNIX time string) i.e. 0969492426
> b. However, for maximum space savings it could be stored as a 4 byte
> string in big endian format with a get/set utility function to handle
> the conversion if necessary.

I think this would be very useful.  I tend to think storing the number
in 4 bytes (or perhaps 5 to take us past 2038...) is worth the effort
since you have to convert the number when storing and retrieving as a
string anyway.  The functions needed are available already (on Unix
at least) as htonl and ntohl.

> 3) Add the document’s MD5 to the value table as a 16 byte string
> (binary representation of the digest) (ID 1).  This could be used as a
> secondary check for incremental indexing (i.e. if the file was touched
> but not changed don’t replace it) and also to collapse duplicates
> (COLLAPSE=1). 
> The md5 source code is from the GNU testutils-2.1 package.

I think this would be useful too.

It'd be marginally better to use a non-GPL md5 implementation (we're
trying to eliminate unrelicensable GPL code from the core library, but
it'd be nice to be able to relicense Omega too).  A quick Google
reveals at least a couple of candidates, though I've not looked at
either in any detail:

http://sourceforge.net/projects/libmd5-rfc/ zlib/libpng License (BSD-ish)
http://www.fourmilab.ch/md5/ public domain

But unless the md5 api is complex, I imagine it'd be easy enough to drop
one of these in instead at a later date.  The GNU version should be very
well tested at least, whereas the above implementations may be less so.

> 4) For files that require command line utility processing (i.e.
> pdftotext) I have added a --copylocal option.  This allows the
> file to be digested while being copied to the local drive and then the
> command line utility processes the local file saving multiple reads
> across the network.

Have you actually benchmarked this?

A decent OS should cache the file's contents and avoid the multiple
reads across the network, so this could end up being slower than just
reading the remote file twice (because the file needs to be written and
flushed to local disk before the filter program gets run).

If it really does help, it seems a useful addition.

> If we want to expand this it could be used to
> build a local cache/backup/repository.  For my use I was thinking of
> putting the files under source control (svn) but that is another
> discussion thread.

I think backup and source control are really outside of the scope of
omindex, unless I misunderstand what you're suggesting here.

> 5) I would also recommend storing the full filename in the document
> data.  file=/mnt/vol1/www/sample.html.  I have a purge utility that
> cleans out documents that are no longer found on the file system using
> this information.

As James says, we have an different approach to purging removed files
during indexing which doesn't require this field.  I don't object
strongly to adding this if it's actually useful though.

> FYI: I am currently migrating to a MySQL metadata repository that will
> move information like this out of the search index; it also preserves
> metadata on complete index rebuilds and allows users to add additional
> information that may not be contained in the actual document.

There's certainly something to be said for keeping information useful
for (re)indexing but not for search in a separate place.  The downside
is that it's hard to flush the Xapian index and metadata store
atomically so you need a robust strategy to cope with indexing being
interrupted when the two aren't in sync.

> Future Items
> ============
> 6) Stream indexer.  Instead of reading the entire file into memory,
> process it line by line.  This should make indexing large files
> more efficient.

Line-by-line isn't much better - it's not unusual to find long HTML
documents which are all on one line (e.g. those produced on an old Mac
where the end of line character is different, or those generated by
a script).

But some sort of chunked reading isn't a bad idea.  The HTML parser
currently relies on indefinite lookahead which may be awkward to do
while dealing with chunks, but that can probably be fixed without
changing how HTML documents parse in cases which actually matter.

> 7) Clean up the fixme’s in mime type handlers i.e. // FIXME: run
> pdfinfo once and parse the output ourselves.  I woudl use pcre to
> extract the desired text.

Even PCRE is really overkill as you're looking for a constant string
in every case.  It's just sheer laziness that I didn't do it right to
start with.  Sorry.

> 8) Change the way stemmed terms are added to the database.  Remove the
> R prefix from raw terms and only write stemmed terms to the DB if they
> differ from the original term, prefixing them with Z?.

"Z?" doesn't match our existing conventions for prefixes, but the choice
of prefix is just cosmetic.

This would mean that a search for "words which stem to 'foo'" would
become foo OR Z?foo, which will be slower and give less accurate
statistics (though they'll probably be some speed gain from reduced
VM file cache pressure in many cases).

So are you suggesting we should generate the non-stemmed terms from
every word?  Currently R terms are only generated for capitalised
words, which is really done to allow searches for a proper nouns
without problems caused by stemming.  However, this feature is
sometimes problematic itself - people type in capitalised words
in queries without knowing about the feature and sometimes the
results returned aren't great.

> If stemming
> was set to none this would reduce the current term tables (termlist,
> postlist, and position) by about 50%. The query parser would
> have to be modified to use the same rules.

Is it really as much as 50% for all of them?  We only generate R terms
for capitalised words, so this suprises me.

I've actually been thinking of reworking how we handle indexing of
stemmed and unstemmed forms myself.  No firm conclusions, but I've
been wondering about indexing all words unstemmed with positional
information and all stemmed forms without.  This would mean that
we could still support phrase searching as we currently implement it,
and NEAR for unstemmed words.  A capitalised word in a query could
search for an unstemmed form, and a non-capitalised word for a
stemmed form.  Also stemming could be turned off at query time.
This would save slightly more space in the position table to your
approach, but not as much in the termlist or postlist table.

Perhaps some combination of our ideas would work.  I think I need to
mull it over more.

Cheers,
    Olly