[Xapian-devel] Proposed changes to omindex

Sun Aug 27 16:05:22 BST 2006

On Sat, Aug 26, 2006 at 10:56:47PM +0100, Olly Betts wrote:

> > Proposed changes to omindex
> 
> One suggestion before I go into details - even if some of these patches
> may not be things we'd want to include in the mainstream releases right
> now, they may still be of interest to some other users.  So I'd
> encourage you to offer them for download, or just post them here if they
> aren't too big.  The same goes for other people with patches they're
> happy to share.

Michael and I discussed briefly having a bit more detailed "outreach"
links on the xapian website. The only reason we don't have more at the
moment is that we haven't really started tracking all the extensions
and uses that people have done (and it's only very recently that
xapian use has really started snowballing, if you'll excuse the
unintentional pun).

I'm thinking of a kind of directory, with links categorised: "useful
patches", and "useful libraries", "useful helper programs" (like
filters for indexing), "systems that integrate xapian", howtos,
whatever. If this seems a good idea, I'm happy to be the contact for
submissions and updates for this. (Does xapian.org auto-update like
snowball does?)

> The first is if the unique id should be based on the file path or the
> URL.  Currently omindex uses the URL, but the file path could be used
> instead.  The main difference I can see is that it would allow the URL
> mappings to be changed without a reindex (providing the omega CGI
> applied the mappings at search time) but I'm not sure how useful that
> really is - I can't remember the last time I reconfigured the url to
> file mappings on any webserver I maintain.

But: Cool URIs Don't Change. So you might radically rearrange the way
you serve your website (moving from static serve to rendering-driven
XML, or to a CMS), but it would be nice if you didn't have to reindex
the whole lot.

I'm aware that I'm unusual in insisting on this sort of thing; I have
to wage small wars at work to get people to believe me. On my side of
the argument are some fairly hefty WebArch names, though :-)

> > 2) Add the document’s last modified time to the value table (ID 0).
> 
> I think this would be very useful.  I tend to think storing the number
> in 4 bytes (or perhaps 5 to take us past 2038...) is worth the effort
> since you have to convert the number when storing and retrieving as a
> string anyway.  The functions needed are available already (on Unix
> at least) as htonl and ntohl.

htonl / ntohl won't work with 5 bytes, and indeed I'd recommend we
either use 4 bytes or 8. (htonll / ntohll exist on Solaris, and there
should be equivalents lying around somewhere on other 64 bit
platforms.)

We *could* start with 4 bytes and then auto-upgrade. Not sure if the
space saving over 8 bytes is actually worth the hassle of maintaining
BC code after 2038 though.

> It'd be marginally better to use a non-GPL md5 implementation (we're
> trying to eliminate unrelicensable GPL code from the core library, but
> it'd be nice to be able to relicense Omega too).
>
> But unless the md5 api is complex, I imagine it'd be easy enough to drop
> one of these in instead at a later date.  The GNU version should be very
> well tested at least, whereas the above implementations may be less so.

Is md5 the right hash for us? I suspect it is, because we don't
actually need strong cryptographic hash properties, but it's worth
thinking about.

> > 4) For files that require command line utility processing (i.e.
> > pdftotext) I have added a --copylocal option.
> 
> If it really does help, it seems a useful addition.

I'd like an option to turn it off, if we do include it. I'm not 100%
certain why I think this, though.

[filename in data field]
> As James says, we have an different approach to purging removed files
> during indexing which doesn't require this field.  I don't object
> strongly to adding this if it's actually useful though.

I think it has definite advantages. More generally, it's a source
identifier, which could be:

 * filename of source file
 * SQL database table primary key
 * Object database lookup key
 * URI of resource with metadata in RDF database

It would be nice to *either* have a separate source type field, *or*
just agree that if you need it, you should probably always stuff
fully-qualified URIs in the field, so you can create your own
private-use URI schemes as needed.

> > FYI: I am currently migrating to a MySQL metadata repository that will
> > move information like this out of the search index; it also preserves
> > metadata on complete index rebuilds and allows users to add additional
> > information that may not be contained in the actual document.
> 
> There's certainly something to be said for keeping information useful
> for (re)indexing but not for search in a separate place.  The downside
> is that it's hard to flush the Xapian index and metadata store
> atomically so you need a robust strategy to cope with indexing being
> interrupted when the two aren't in sync.

If you have a really sophisticated setup and really, really need this
kind of thing, with some work to tidy things up on rollback you can
use a distributed transaction mechanism such as JTA.

My feeling is that omega out of the box should just neatly work out of
the Xapian db (with some sort of config file that describes your
setup), and then if you want to do something much more interesting we
should provide a bit of guidance on how to approach it. In complex
systems, having multiple EIS is almost always going to be the right
practical approach, at least at the moment.

> So are you suggesting we should generate the non-stemmed terms from
> every word?  Currently R terms are only generated for capitalised
> words, which is really done to allow searches for a proper nouns
> without problems caused by stemming.  However, this feature is
> sometimes problematic itself - people type in capitalised words
> in queries without knowing about the feature and sometimes the
> results returned aren't great.

I think the problem here is more to do with that. Could we have an
option to lowercase the query string beforehand, just a CGI param you
can punt into omega?

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org