[Xapian-devel] Proposed changes to omindex

Sun Aug 27 19:00:44 BST 2006

On Sun, Aug 27, 2006 at 04:05:22PM +0100, James Aylett wrote:
> On Sat, Aug 26, 2006 at 10:56:47PM +0100, Olly Betts wrote:
> 
> > One suggestion before I go into details - even if some of these patches
> > may not be things we'd want to include in the mainstream releases right
> > now, they may still be of interest to some other users.  So I'd
> > encourage you to offer them for download, or just post them here if they
> > aren't too big.  The same goes for other people with patches they're
> > happy to share.
> 
> Michael and I discussed briefly having a bit more detailed "outreach"
> links on the xapian website. The only reason we don't have more at the
> moment is that we haven't really started tracking all the extensions
> and uses that people have done (and it's only very recently that
> xapian use has really started snowballing, if you'll excuse the
> unintentional pun).

I pick up new uses of Xapian and note them in a file which I go through
periodically and add to users.php, but it's a suprisingly time-consuming
job.

> I'm thinking of a kind of directory, with links categorised: "useful
> patches", and "useful libraries", "useful helper programs" (like
> filters for indexing), "systems that integrate xapian", howtos,
> whatever. If this seems a good idea, I'm happy to be the contact for
> submissions and updates for this. (Does xapian.org auto-update like
> snowball does?)

It's supposed to be automatic, but actually you have to press this
button...

In reality, it's a pain for others to update as things currently are,
mostly because the search index is updated by the script and that's
owned by my userid.  Nobody else has shown any desire to update pages,
so I've not worried about it so far.  You can just copy new files
directly to the web tree though (I often do for trivial changes), but
make sure the permissions are sane if you do (and check the changes in
too or they'll get lost!)

It might be better to put this directory on the wiki anyway - it's the
sort of thing we created the wiki for, and it would allow people to just
add their own entries.  Then your job would just be to make sure that
things stay tidy and sort out links which go dead.

> > The first is if the unique id should be based on the file path or the
> > URL.  Currently omindex uses the URL, but the file path could be used
> > instead.  [...]
> 
> But: Cool URIs Don't Change. So you might radically rearrange the way
> you serve your website (moving from static serve to rendering-driven
> XML, or to a CMS), but it would be nice if you didn't have to reindex
> the whole lot.

FWIW, this argues for the status quo.

> > > 2) Add the document’s last modified time to the value table (ID 0).
> > 
> > I think this would be very useful.  I tend to think storing the number
> > in 4 bytes (or perhaps 5 to take us past 2038...) is worth the effort
> > since you have to convert the number when storing and retrieving as a
> > string anyway.  The functions needed are available already (on Unix
> > at least) as htonl and ntohl.
> 
> htonl / ntohl won't work with 5 bytes, and indeed I'd recommend we
> either use 4 bytes or 8.

No, but it's easy to handle the extra byte yourself (and it can just be
a zero right now anyway).

> We *could* start with 4 bytes and then auto-upgrade. Not sure if the
> space saving over 8 bytes is actually worth the hassle of maintaining
> BC code after 2038 though.

The auto-upgrade would be rather painful for a large database (though to
be honest I'd be astonished if we don't have an incompatible database
format change in the next 32 years anyway), which is why I suggested we
might want to put the extra byte in ahead of time.

8 really is overkill - we are considering dates on files here, so it's
only dates which have happened which are relevant, and 5 bytes takes you
to 36443!

Anyway, I think it's sanest just to go with 4 bytes for now.

> > It'd be marginally better to use a non-GPL md5 implementation (we're
> > trying to eliminate unrelicensable GPL code from the core library, but
> > it'd be nice to be able to relicense Omega too).
> >
> > But unless the md5 api is complex, I imagine it'd be easy enough to drop
> > one of these in instead at a later date.  The GNU version should be very
> > well tested at least, whereas the above implementations may be less so.
> 
> Is md5 the right hash for us? I suspect it is, because we don't
> actually need strong cryptographic hash properties, but it's worth
> thinking about.

I had already considered this - the only concern I can see is that
somebody malicious might create a document with an identical MD5
checksum to one that they don't want you to find.  This seems a very
artificial situation though.

The problem is that any cryptographic hash gets less secure as computing
power increases (and as researchers discover shortcuts to attack it with
less complexity than brute-force requires).  But we don't want to pick
something fantastically secure for decades to come but currently insanely
computationally intensive as we need to run it on every file we index
or consider for reindexing.

A quick test suggests the SHA-1 is a bit more than 50% slower than MD5,
and SHA-1 isn't looking particularly future-proof.

So I think MD5 is probably an appropriate choice currently.

> > > 4) For files that require command line utility processing (i.e.
> > > pdftotext) I have added a --copylocal option.
> > 
> > If it really does help, it seems a useful addition.
> 
> I'd like an option to turn it off, if we do include it. I'm not 100%
> certain why I think this, though.

For the case where you're indexing from local disk already!  I suspect
this is much more common than indexing from a network drive.

> [filename in data field]
> > As James says, we have an different approach to purging removed files
> > during indexing which doesn't require this field.  I don't object
> > strongly to adding this if it's actually useful though.
> 
> I think it has definite advantages. More generally, it's a source
> identifier, which could be:
> 
>  * filename of source file
>  * SQL database table primary key
>  * Object database lookup key
>  * URI of resource with metadata in RDF database

But omindex isn't this general - it indexes files forming a website.
There's nothing to stop people who are indexing from other sources
(whether with scriptindex or a custom indexer) adding a source
identifier if they find it useful, but let's consider whether it's
generally useful for omindex to do it rather than looking at other
situations.

> It would be nice to *either* have a separate source type field, *or*
> just agree that if you need it, you should probably always stuff
> fully-qualified URIs in the field, so you can create your own
> private-use URI schemes as needed.

If you're taking the "every URI is sacred" view, the source identifier
can change while the URI doesn't and the document doesn't get reindexed.
So it could be stale information anyway.

> > However, this feature is sometimes problematic itself - people type
> > in capitalised words in queries without knowing about the feature
> > and sometimes the results returned aren't great.
> 
> I think the problem here is more to do with that. Could we have an
> option to lowercase the query string beforehand, just a CGI param you
> can punt into omega?

You can already achieve the same end result rather less crudely by
using $set{stem_all,true} in the query template.  If you want it
conditional on a CGI parameter, just use:

  $if{$eq{$cgi{STEMALL},yes},$set{stem_all,true}}

Cheers,
    Olly