[Xapian-devel] term duplication among index tables

richard at lemurconsulting.com richard at lemurconsulting.com
Fri Nov 3 19:47:30 GMT 2006


On Fri, Nov 03, 2006 at 10:19:51AM -0800, Peter A. Friend wrote:
> I have been looking into using Xapian to provide search for email. I  
> have to be very careful about indexing overhead, and I notice that  
> several of the tables use term strings as part of the key or data. I  
> was wondering if there is a performance or complexity reason for not  
> having a separate table mapping term strings to unique numbers, which  
> could then be used in the other tables. Is this something that has  
> been considered previously and discarded as unworkable, or do you  
> think it may be worth pursuing?

It's a perfectly good question, and one which has been discussed several
times in the past (though maybe not on this list).  Indeed, IIRC, one of
the early backend implementations used a central lexicon of terms, and
referenced this lexicon using term IDs in the posting list, position list
and termlist tables.  This idea was discarded at a later date for
performance reasons.

The problem is that while getting the database size as small as possible is
useful, our primary goal with Xapian (at least, with the current backends)
is to make searches as fast as possible; and the speed is very much
dependent on the number of disk reads.  Use of a lexicon risks forcing
searches to perform an extra disk read for each term in the query, to look
up the term in the lexicon.

It is possible that the idea could still be useful for some tables, since
the lexicon should be relatively small, and could therefore get largely
cached in memory, but this would, of course, reduce the amount of space
available for caching other parts of the database.

I would be surprised if it could be used effectively for storing termlists,
since accessing such a list would require lots of lookups in the lexicon.
In any case, these are now rather well compressed (at least with the flint
backend).  It doesn't seem inconceivable that it could be useful for the
posting list or position list tables.

If you want to try any experiments, profiling results of an implementation
with term IDs would be quite interesting.

-- 
Richard



More information about the Xapian-devel mailing list