[Xapian-devel] Proposed changes to omindex

Sat Aug 19 19:22:10 BST 2006

On Thu, Aug 10, 2006 at 10:52:59PM -0700, Michael Trinkala wrote:

> 1) Have the Q prefix contain the 16 byte MD5 of the full file name
> used for document lookup during indexing.

I don't think this is generally useful, for reasons previously given:
omega/omindex are really targetted to indexing and searching web
sites, where the URI is the identifier. A filename used to provide a
representation of that resource isn't at all interesting to omega, and
is only partly interesting to omindex (ie: there are other ways of
doing it). omindex is pretty limited in any case, and if you're doing
anything funky you'll be using scriptindex or your own indexer. Within
that, how you generate Q-terms and manage your documents is of course
entirely up to you.

> 4) For files that require command line utility processing
> (i.e. pdftotext) I have added a --copylocal option.  This allows the
> file to be digested while being copied to the local drive and then
> the command line utility processes the local file saving multiple
> reads across the network. If we want to expand this it could be used
> to build a local cache/backup/repository.  For my use I was thinking
> of putting the files under source control (svn) but that is another
> discussion thread.

This is neat. I agree that for anything more complex it's not actually
going to solve all the requirements, but for remote files it can
work. (Although any decent network fs has built-in caching, and in any
case you could rely on the OS buffers - if you open() first, then dup
the filedes, then use fdopen() to turn it into a FILE* - twice -
there's very little reason you'll have to hit the network twice, even
on a lame net fs. Do you have any timing data on how much this
improves things for you?)

> 5) I would also recommend storing the full filename in the document
> data.  file=/mnt/vol1/www/sample.html.  I have a purge utility that
> cleans out documents that are no longer found on the file system
> using this information.  FYI: I am currently migrating to a MySQL
> metadata repository that will move information like this out of the
> search index; it also preserves metadata on complete index rebuilds
> and allows users to add additional information that may not be
> contained in the actual document.

omindex has its own mechanism for purging documents that no longer
exist. Again, the separation from logical URI to physical storage
pushes me in the direction of not wanting this in omindex.

One idea I've talked to someone about is separating omindex into
something that drives scriptindex, which in theory would allow you to
use the file spider in omindex with whatever indexing strategy you
wanted.

Speaking of metadata, what I'd really like is a Xapian-indexable RDF
store. I doubt anyone else wants one of those though :-)

> 8) Change the way stemmed terms are added to the database.  Remove
> the R prefix from raw terms and only write stemmed terms to the DB
> if they differ from the original term, prefixing them with Z?. If
> stemming was set to none this would reduce the current term tables
> (termlist, postlist, and position) by about 50%. The query parser
> would have to be modified to use the same rules.

Currently, you only get dual terms if the initial letter is a
capital. On a sample database I have here of an old blog, I have:

24535 terms in total
8157 R-terms
1718 other prefixed terms

So we'd get a saving of 33% by dropping R-terms when stemming; however
we'd then lose much if not all of that saving (which I can't calculate
without passing over the original data again) by having to put stemmed
versions back in again, whether an R-term would have been generated or
not. Mind you, a *very* quick test suggests that on some of my data,
no more than 25% of words actually stem to something different. I
suspect this is because there are lots of short words in everyday
English. So there could be some saving here.

If you're not using stemming, and are content to force everything into
lowercase (modulo the excitement that causes with Unicode), dropping
R-terms seems a good strategy. I'd certainly favour having a way of
running the query parser that didn't need R-terms, and then perhaps a
way of driving omindex/scriptindex to not generate them in the first
place. It's a pretty easy change, in index_text.cc:index_text().

I think this all comes down to whether you think stemming is a good
default or not. If you're more concerned about stemmed forms, you want
them to be obvious and probably unprefixed. (It's certainly easier to
debug this way.)

> Let me know if you are interested in including any of these changes
> in Xapian.

I think the best thing is to wait until Olly's back and has a chance
to digest all these and comment on them himself. It's really up to him
what goes in anyway :-)

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org