[Xapian-tickets] [Xapian] #819: What is the impact of block_size parameter in Database::compact method ?

Xapian nobody at xapian.org
Wed Aug 3 22:30:46 BST 2022


#819: What is the impact of block_size parameter in Database::compact method ?
-------------------------+-------------------------------
 Reporter:  mgautier     |             Owner:  Olly Betts
     Type:  enhancement  |            Status:  new
 Priority:  normal       |         Milestone:
Component:  Other        |           Version:
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+-------------------------------
Comment (by Olly Betts):

 It's a good question, but I don't think we really have a clear answer to
 add to the documentation.

 https://getting-started-with-
 xapian.readthedocs.io/en/latest/advanced/admin_notes.html#databases
 discusses this a bit:

 > The .glass file actually stores the data, and is structured as a tree of
 blocks, which have a default size of 8KB (though this can be set, either
 through the Xapian API, or with some of the tools detailed later in this
 document).
 >
 > Changing the blocksize may have performance implications, but it is hard
 to know whether these will be positive or negative for a particular
 combination of hardware and software without doing some profiling.

 There's a couple of points we could probably also mention there.

 Making the blocksize a multiple of (or the same as) both the sector size
 of the device and the blocksize of the filing system which the database is
 on is almost certainly a good plan, but sector size seems to always be 4K
 or less (https://en.wikipedia.org/wiki/Disk_sector) and FS block size
 still seems to be 4K by default (the widely used ext4 potentially supports
 up to 64K but only up to the system page size which is 4K on e.g. x86 and
 x86-64).  So it seems in practice this is typically not actually going to
 be a consideration.

 The main benefits a larger blocksize gives are slightly more efficient
 packing plus reduced total per-block overheads (and the additional gains
 here are likely to be smaller for each extra block size doubling), while
 the downside is needing to read/write more data to read/write a single
 block.  The extra data is at least contiguous (at least in file offset
 terms - maybe not always on disk) but there are potentially significant
 negative factors like added pressure on the drive cache and OS file cache.
 The additional losses are likely to grow for each extra block size
 doubling.

 In general for most people just using the default block size is sensible.
 It's something you might tune when you either care more about reducing
 size over anything else, or if you're prepared to profile your complete
 system with different block sizes to see what works best for your own
 situation.

 BTW, if you're creating a read-only database, using the single-file glass
 format is worth considering.  It's not going to save you disk space
 (beyond saving a few inodes) but it means only one file needs to be opened
 to open the database so reduces initialisation overhead a little, and a
 single file is more convenient if you need to copy it around.  You can
 even embed the database in another file so you can ship a single file
 containing content and a Xapian database which provides a search of it.
-- 
Ticket URL: <https://trac.xapian.org/ticket/819#comment:1>
Xapian <https://xapian.org/>
Xapian


More information about the Xapian-tickets mailing list