[Xapian-devel] term duplication among index tables

Fri Nov 3 22:03:34 GMT 2006

On Fri, Nov 03, 2006 at 07:47:30PM +0000, richard at lemurconsulting.com wrote:
> On Fri, Nov 03, 2006 at 10:19:51AM -0800, Peter A. Friend wrote:
> > I was wondering if there is a performance or complexity reason for not  
> > having a separate table mapping term strings to unique numbers, which  
> > could then be used in the other tables. Is this something that has  
> > been considered previously and discarded as unworkable, or do you  
> > think it may be worth pursuing?
> 
> It's a perfectly good question, and one which has been discussed several
> times in the past (though maybe not on this list).

Actually, I'm not sure it's been discussed since the project became
Xapian, but we certainly talked about it at BrightStation.

> Indeed, IIRC, one of the early backend implementations used a central
> lexicon of terms, and referenced this lexicon using term IDs in the
> posting list, position list and termlist tables.

Yes, an early version of Quartz had a lexicon table.  I think the long
defunct "sleepycat" backend did too.

> This idea was discarded at a later date for performance reasons.

My recollection is that we originally gave up on this approach because it
greatly complicated handling multiple databases and query expansion.
But it turned out there were also performance arguments too.

> The problem is that while getting the database size as small as possible is
> useful, our primary goal with Xapian (at least, with the current backends)
> is to make searches as fast as possible; and the speed is very much
> dependent on the number of disk reads.  Use of a lexicon risks forcing
> searches to perform an extra disk read for each term in the query, to look
> up the term in the lexicon.

The speed of a search with nothing cached depends almost entirely on the
number of disk reads, though this is less true once the system has been
running a while (which is more important in many cases). because most of
the blocks will be cached.  But search times on a "cold" database matter
for some uses, and it's rare to have absolutely everything cached even
on a "hot" database.

> I would be surprised if it could be used effectively for storing termlists,
> since accessing such a list would require lots of lookups in the lexicon.
> In any case, these are now rather well compressed (at least with the flint
> backend).

And with quartz too.  The list of terms is compressed by removing any
prefix in common with the previous term, which probably gives
compression comparable to that you'd get from numeric termids.  I'd
imagine numeric termids are best compressed by storing them in sorted
order and using interpolative coding.

> It doesn't seem inconceivable that it could be useful for the
> posting list or position list tables.

It would help make the keys of these tables smaller, but I have plans to
implement a scheme to remove common prefixes from keys which should
address the same issue in a rather different way.

> If you want to try any experiments, profiling results of an implementation
> with term IDs would be quite interesting.

The current approach isn't with its problems.  For example, term length
is arbitrarily limited by the Btree key size.

I wonder if storing short terms in place and using numeric ids for long
terms would work well and give the best features from both approaches?
Though it would add complexity.

Cheers,
    Olly