[Xapian-tickets] [Xapian] #686: Backend support for >= 2**32 documents

Xapian nobody at xapian.org
Tue Nov 17 21:49:19 GMT 2015


#686: Backend support for >= 2**32 documents
---------------------------+------------------------------
 Reporter:  olly           |             Owner:  olly
     Type:  defect         |            Status:  new
 Priority:  normal         |         Milestone:  1.3.x
Component:  Backend-Glass  |           Version:  SVN trunk
 Severity:  normal         |        Resolution:
 Keywords:                 |        Blocked By:
 Blocking:                 |  Operating System:  All
---------------------------+------------------------------

Comment (by olly):

 I wrote a script to compare the encodings.

  * `now` is the current encoding, shown for comparison
  * `now3` is the current encoded extended to use a 3 bit length.
  * `unary` is my proposed new encoding
  * the others are as described in Managing Gigabytes 2nd ed, p117 (golbx
 being Golomb with b=x)

 The cell values are the number of bytes needed to encode a value with that
 many bits with the specified encoding.

 The shortest values per row are bold, and the longest italic (it happens
 every value is either the best or worst in its row).  For rows where all
 values are the same, bold and italic aren't used.  The `now` encoding is
 ignored when calculating the best and worst (since it can't represent the
 full 64-bit range).

 ||=bits  =||=now3
 =||=unary=||=gamma=||=delta=||=golb3=||=golb6=||=best=||=now=||
 ||1-13    ||2      ||2      ||2      ||2      ||2      ||2      ||2||(2)||
 ||14      ||''3''  ||'''2'''||'''2'''||'''2'''||'''2'''||''3''  ||2||(2)||
 ||15      ||''3''  ||'''2'''||'''2'''||'''2'''||''3''  ||''3''  ||2||(3)||
 ||16-20   ||3      ||3      ||3      ||3      ||3      ||3      ||3||(3)||
 ||21      ||'''3'''||'''3'''||'''3'''||''4''  ||'''3'''||'''3'''||3||(3)||
 ||22      ||''4''  ||'''3'''||''4''  ||''4''  ||''4''  ||''4''  ||3||(3)||
 ||23-28   ||4      ||4      ||4      ||4      ||4      ||4      ||4||(4)||
 ||29      ||'''4'''||'''4'''||'''4'''||''5''  ||'''4'''||''5''  ||4||(4)||
 ||30-35   ||5      ||5      ||5      ||5      ||5      ||5      ||5||(5)||
 ||36      ||'''5'''||'''5'''||''6''  ||''6''  ||'''5'''||'''5'''||5||(5)||
 ||37      ||'''5'''||''6''  ||''6''  ||''6''  ||'''5'''||''6''  ||5||(5)||
 ||38-43   ||6      ||6      ||6      ||6      ||6      ||6      ||6||(5)
 for 38 only||
 ||44      ||'''6'''||''7''  ||''7''  ||''7''  ||'''6'''||'''6'''||6||||
 ||45      ||'''6'''||''7''  ||''7''  ||''7''  ||''7''  ||''7''  ||6||||
 ||46-50   ||7      ||7      ||7      ||7      ||7      ||7      ||7||||
 ||51      ||'''7'''||''8''  ||'''7'''||'''7'''||'''7'''||'''7'''||7||||
 ||52      ||'''7'''||''8''  ||''8''  ||''8''  ||'''7'''||'''7'''||7||||
 ||53      ||'''7'''||''8''  ||''8''  ||''8''  ||''8''  ||''8''  ||7||||
 ||54-57   ||8      ||8      ||8      ||8      ||8      ||8      ||8||||
 ||58      ||'''8'''||''9''  ||'''8'''||'''8'''||'''8'''||'''8'''||8||||
 ||59      ||'''8'''||''9''  ||'''8'''||'''8'''||'''8'''||'''8'''||8||||
 ||60      ||'''8'''||''9''  ||''9''  ||''9''  ||'''8'''||'''8'''||8||||
 ||61      ||'''8'''||''9''  ||''9''  ||''9''  ||''9''  ||''9''  ||8||||
 ||62-64   ||9      ||9      ||9      ||9      ||9      ||9      ||9||||

 No one encoding is best everywhere, but unary seems as good as any of the
 other encodings in the table, and is the only one which is at least as
 good as the current encoding for all values currently encoded (i.e. it
 wouldn't add overhead to any database you can currently work with).

--
Ticket URL: <http://trac.xapian.org/ticket/686#comment:4>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list