[Xapian-devel] Proposed changes to omindex

James Aylett james-xapian at tartarus.org
Sun Aug 27 19:47:10 BST 2006


On Sun, Aug 27, 2006 at 07:00:44PM +0100, Olly Betts wrote:

> It might be better to put this directory on the wiki anyway - it's the
> sort of thing we created the wiki for, and it would allow people to just
> add their own entries.  Then your job would just be to make sure that
> things stay tidy and sort out links which go dead.

I have a thing against projects that insist on using a wiki for
permanent documentation. It just never feels very professional, and is
often difficult to keep neat. Of course if I'm triaging it that's less
of an issue.

In my view the "right" solution *will* be to use the wiki as a highly
mutable playpen for the documentation, links etc., and then use a CMS
to manage the website. (Still not sure where this leaves the core
documentation that needs to be exported as a book, although some CMSs
have suitable functionality.)

The reason I say "will" is that we'd want Xapian integrated into the
CMS, and I don't think there are yet enough where that's been done to
give us a reasonable choice.

For now, using the wiki seems sensible. We at least are using one with
decent notification features, which makes my life easier.

> > We *could* start with 4 bytes and then auto-upgrade. Not sure if the
> > space saving over 8 bytes is actually worth the hassle of maintaining
> > BC code after 2038 though.
> 
> The auto-upgrade would be rather painful for a large database (though to
> be honest I'd be astonished if we don't have an incompatible database
> format change in the next 32 years anyway), which is why I suggested we
> might want to put the extra byte in ahead of time.

By auto-upgrade, I *don't* mean upgrading the database, I mean
transparency to the end user, ie your readers (omindex, primarily) can
cope with either 4-byte or 8-byte, and upgrade as they update
documents. So starting with 4-byte is fine.

> > Is md5 the right hash for us? I suspect it is, because we don't
> > actually need strong cryptographic hash properties, but it's worth
> > thinking about.
> 
> I had already considered this - the only concern I can see is that
> somebody malicious might create a document with an identical MD5
> checksum to one that they don't want you to find.  This seems a very
> artificial situation though.

Yeah. There are easier ways to stop you finding the document...

> So I think MD5 is probably an appropriate choice currently.

Probably. We mostly care about speed and size of output, rather than
strength.

> > > > 4) For files that require command line utility processing (i.e.
> > > > pdftotext) I have added a --copylocal option.
> > 
> > I'd like an option to turn it off, if we do include it. I'm not 100%
> > certain why I think this, though.
> 
> For the case where you're indexing from local disk already!  I suspect
> this is much more common than indexing from a network drive.

Oh yeah. D'oh!

Again, I'd like to see some figures on this helping for network
drives, and then do some tweaking to see if it can be mitigated or
eliminated. In particular, I don't think it's worth adding features to
help out with doing this over nfs pre-v4. I suspect the number of
people wanting to do this over CIFS are limited (there aren't a
terribly good set of reasons for doing it), and other NAS protocols
probably aren't going to concern us too much. (Anyone got experience
with SAMFS?)

> > [filename in data field]
> > > As James says, we have an different approach to purging removed files
> > > during indexing which doesn't require this field.  I don't object
> > > strongly to adding this if it's actually useful though.
> > 
> > I think it has definite advantages. More generally, it's a source
> > identifier, which could be:
> > 
> >  * filename of source file
> >  * SQL database table primary key
> >  * Object database lookup key
> >  * URI of resource with metadata in RDF database
> 
> But omindex isn't this general - it indexes files forming a website.
> There's nothing to stop people who are indexing from other sources
> (whether with scriptindex or a custom indexer) adding a source
> identifier if they find it useful, but let's consider whether it's
> generally useful for omindex to do it rather than looking at other
> situations.

Okay, but if omindex added the file path as the source identifier, I
can see how that would be useful. In particular, if you (for some
reason) batch delete files, it's an awful lot quicker than using
omindex to reindex the entire system to get rid of them from xapian.

> > It would be nice to *either* have a separate source type field, *or*
> > just agree that if you need it, you should probably always stuff
> > fully-qualified URIs in the field, so you can create your own
> > private-use URI schemes as needed.
> 
> If you're taking the "every URI is sacred" view, the source identifier
> can change while the URI doesn't and the document doesn't get reindexed.
> So it could be stale information anyway.

Yes, but that's your concern. I wouldn't advise using file paths for
source identifiers precisely for that reason, but many people will be
comfortable with that.

> > > However, this feature is sometimes problematic itself - people type
> > > in capitalised words in queries without knowing about the feature
> > > and sometimes the results returned aren't great.
> > 
> > I think the problem here is more to do with that. Could we have an
> > option to lowercase the query string beforehand, just a CGI param you
> > can punt into omega?
> 
> You can already achieve the same end result rather less crudely by
> using $set{stem_all,true} in the query template.  If you want it
> conditional on a CGI parameter, just use:
> 
>   $if{$eq{$cgi{STEMALL},yes},$set{stem_all,true}}

Okay, that'll do. Something else for the how do I section of the
documentation...

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-devel mailing list