[Xapian-tickets] [Xapian] #822: Honey format tweaks

Xapian nobody at xapian.org
Wed Aug 16 04:16:59 BST 2023


#822: Honey format tweaks
----------------------------------+------------------------
        Reporter:  Olly Betts     |      Owner:  Olly Betts
            Type:  defect         |     Status:  new
        Priority:  normal         |  Milestone:  1.5.0
       Component:  Backend-Honey  |    Version:
        Severity:  normal         |   Keywords:
      Blocked By:                 |   Blocking:
Operating System:  All            |
----------------------------------+------------------------
 The encoding of spelling "tail" and "bookend" term lists could be
 improved.

 In honey the spelling data encoding makes use of knowing that the last 2
 (for tail) or 1 (for bookend) bytes are fixed and that we can know them by
 looking at the key, but we still store a reuse byte for the first entry.
 This could reuse up to two bytes, but usually won't save any and takes a
 byte to store, so overall it costs us slightly under one byte per tail and
 per bookend term list.  That's less than twice the number of spelling
 targets (typically significantly so since many words have the same last
 two bytes / same first and last byte) so it's not a vast saving (e.g. the
 largest spelling data table I have to hand is from recoll which has 494633
 spelling targets but only 1617 bookends and 1802 tails, so the saving
 there would be at most 3419 bytes), but supporting this also complicates
 decode because it is possible for the reuse and tail to overlap (we
 weren't handling this situation correctly until
 99873ea22f22e8cb99d4f1db2d6591c2f725afa8) so we really should sort it out
 at some point.
-- 
Ticket URL: <https://trac.xapian.org/ticket/822>
Xapian <https://xapian.org/>
Xapian


More information about the Xapian-tickets mailing list