can crit-bit trees be suitable for Xapian storage?

Eric Wong e at 80x24.org
Mon Jul 26 10:21:05 BST 2021


Hey all,

I'm also curious if crit-bit trees can be suited as an on-disk
structure for Xapian: https://cr.yp.to/critbit.html

One of the most appealing properties to me is it doesn't require
balancing operations of typical trees, nor resize operations of
hash tables.  The other appealing property is it is very little
code (at least the in-memory versions).

I recently introduced an in-memory implementation into git.git
(now 'next' branch) for doing SHA abbreviation lookups on loose
objects.  It'll likely be in git.git 'master' and a release, soon.

Good properties of crit-bit trees:

+ O(key_length) lookups/insert/delete (one branch per-differing bit)

+ ability to filter and iterate on prefixes

+ no balancing operations, expected to reduce writes

+ low per-node overhead, compatible with "container_of" idiom

+ very little code (even I can understand it :)

+ the prefixes could also be suitable for managing free space;
  I think dlmalloc 2.8.x does something similar.

+ everything else mentioned in https://cr.yp.to/critbit.html

Downsides:

- iteration requires recursion, which is bound by
  key length; so arbitrarily long keys will hurt.

- relatively new and AFAIK unproven as an on-disk format.
  CPU cache lines are ~64-bytes; FS and SSD blocks can be some
  orders of magnitude larger so there'll still be more write
  amplification.

- affected by locality and fragmentation (but AFAIK all large
  data/memory structures do, and we have xapian-compact).

I haven't thought much about how transactions would work with
on-disk formats, though.


With glass (xapian 1.4.11 on Debian stable); I seemed to be
hitting a fair amount of write traffic doing mass updates with
3 big shards, ~80GB each.

It seems to be the same problem that affected recoll:
https://lists.xapian.org/pipermail/xapian-discuss/2019-February/009727.html
<23640.32170.703368.841021 at y.dockes.com>
(I'd have to setup on a different machine to profile)

I encountered my own problems having too many shards:
https://lists.xapian.org/pipermail/xapian-discuss/2020-August/009823.html
<20200826064728.GA32239 at dcvr>

Which is why I have 3 big shards (4-cores) atm...
I don't have a lot of RAM so I commit fairly frequently;
and I'm also on SATA-2 which is a bottleneck.

Anyways, thanks for reading.



More information about the Xapian-devel mailing list