[Xapian-tickets] [Xapian] #819: What is the impact of block_size parameter in Database::compact method ?
Xapian
nobody at xapian.org
Wed Aug 3 22:30:46 BST 2022
#819: What is the impact of block_size parameter in Database::compact method ?
-------------------------+-------------------------------
Reporter: mgautier | Owner: Olly Betts
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Other | Version:
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+-------------------------------
Comment (by Olly Betts):
It's a good question, but I don't think we really have a clear answer to
add to the documentation.
https://getting-started-with-
xapian.readthedocs.io/en/latest/advanced/admin_notes.html#databases
discusses this a bit:
> The .glass file actually stores the data, and is structured as a tree of
blocks, which have a default size of 8KB (though this can be set, either
through the Xapian API, or with some of the tools detailed later in this
document).
>
> Changing the blocksize may have performance implications, but it is hard
to know whether these will be positive or negative for a particular
combination of hardware and software without doing some profiling.
There's a couple of points we could probably also mention there.
Making the blocksize a multiple of (or the same as) both the sector size
of the device and the blocksize of the filing system which the database is
on is almost certainly a good plan, but sector size seems to always be 4K
or less (https://en.wikipedia.org/wiki/Disk_sector) and FS block size
still seems to be 4K by default (the widely used ext4 potentially supports
up to 64K but only up to the system page size which is 4K on e.g. x86 and
x86-64). So it seems in practice this is typically not actually going to
be a consideration.
The main benefits a larger blocksize gives are slightly more efficient
packing plus reduced total per-block overheads (and the additional gains
here are likely to be smaller for each extra block size doubling), while
the downside is needing to read/write more data to read/write a single
block. The extra data is at least contiguous (at least in file offset
terms - maybe not always on disk) but there are potentially significant
negative factors like added pressure on the drive cache and OS file cache.
The additional losses are likely to grow for each extra block size
doubling.
In general for most people just using the default block size is sensible.
It's something you might tune when you either care more about reducing
size over anything else, or if you're prepared to profile your complete
system with different block sizes to see what works best for your own
situation.
BTW, if you're creating a read-only database, using the single-file glass
format is worth considering. It's not going to save you disk space
(beyond saving a few inodes) but it means only one file needs to be opened
to open the database so reduces initialisation overhead a little, and a
single file is more convenient if you need to copy it around. You can
even embed the database in another file so you can ship a single file
containing content and a Xapian database which provides a search of it.
--
Ticket URL: <https://trac.xapian.org/ticket/819#comment:1>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list