[Xapian-devel] term duplication among index tables

Peter Friend octavian at corp.earthlink.net
Fri Nov 3 20:35:29 GMT 2006


On Nov 3, 2006, at 11:47 AM, richard at lemurconsulting.com wrote:

> On Fri, Nov 03, 2006 at 10:19:51AM -0800, Peter A. Friend wrote:
>> I was wondering if there is a performance or complexity reason for  
>> not
>> having a separate table mapping term strings to unique numbers, which
>> could then be used in the other tables. Is this something that has
>> been considered previously and discarded as unworkable, or do you
>> think it may be worth pursuing?
>
> The problem is that while getting the database size as small as  
> possible is
> useful, our primary goal with Xapian (at least, with the current  
> backends)
> is to make searches as fast as possible; and the speed is very much
> dependent on the number of disk reads.  Use of a lexicon risks forcing
> searches to perform an extra disk read for each term in the query,  
> to look
> up the term in the lexicon.

Trading space for more performance is certainly a reasonable  
tradeoff. Since the backends are basically B+ trees, I figured that  
space saved by using a term ID might allow more of the index pages  
for the other tables to be cached in memory (and possibly reduce disk  
hits), but how this compares with the existing compression schemes  
seems like it would be very sensitive to the types of documents being  
indexed.

If I manage the time to attempt such an overhaul, I'll share what I  
find.

Cheers,

Peter





More information about the Xapian-devel mailing list