[Xapian-devel] term duplication among index tables
Peter Friend
octavian at corp.earthlink.net
Fri Nov 3 20:35:29 GMT 2006
On Nov 3, 2006, at 11:47 AM, richard at lemurconsulting.com wrote:
> On Fri, Nov 03, 2006 at 10:19:51AM -0800, Peter A. Friend wrote:
>> I was wondering if there is a performance or complexity reason for
>> not
>> having a separate table mapping term strings to unique numbers, which
>> could then be used in the other tables. Is this something that has
>> been considered previously and discarded as unworkable, or do you
>> think it may be worth pursuing?
>
> The problem is that while getting the database size as small as
> possible is
> useful, our primary goal with Xapian (at least, with the current
> backends)
> is to make searches as fast as possible; and the speed is very much
> dependent on the number of disk reads. Use of a lexicon risks forcing
> searches to perform an extra disk read for each term in the query,
> to look
> up the term in the lexicon.
Trading space for more performance is certainly a reasonable
tradeoff. Since the backends are basically B+ trees, I figured that
space saved by using a term ID might allow more of the index pages
for the other tables to be cached in memory (and possibly reduce disk
hits), but how this compares with the existing compression schemes
seems like it would be very sensitive to the types of documents being
indexed.
If I manage the time to attempt such an overhaul, I'll share what I
find.
Cheers,
Peter
More information about the Xapian-devel
mailing list