[Xapian-tickets] [Xapian] #763: Track unique term bounds for documents in the collection

Xapian nobody at xapian.org
Mon Jul 23 08:03:50 BST 2018


#763: Track unique term bounds for documents in the collection
-------------------------+---------------------------
 Reporter:  gp1308       |             Owner:  gp1308
     Type:  enhancement  |            Status:  new
 Priority:  normal       |         Milestone:
Component:  Library API  |           Version:
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+---------------------------

Comment (by olly):

 The tracking should look a lot like the tracking for the document length
 bounds.  These are stored in the "version" file - i.e. `iamglass` for
 glass.  See `doclen_lbound` and `doclen_ubound` in `glass_version.cc`.

 Unfortunately that code checks that there's no undecoded data after it has
 decoded the stats we know about, so we can't just add new stats and have
 older versions ignore them.  In hindsight we should have omitted that
 check so we could add new stats.

 So probably we don't implement this for glass,   I'd suggest just not
 worrying about it being an incompatible change for now - we'll probably
 not merge this change for glass, but instead apply it for honey, which is
 the next generation backend but still in development.  But honey doesn't
 yet support updating databases - currently you have to compact a glass
 database to create a honey one, so implementing this for honey without
 implementing it for glass means that the compacting code which converts
 from glass to honey needs to calculate these bounds as it loops over all
 the documents - probably as it does the termlist table.

 That code is in `backends/honey/honey_compact.cc`, line 1866 currently.
 That loop needs to count how terms have a non-zero wdf to get the number
 of unique terms in each document, and then track lower and upper bounds on
 that as we work through the table (the lower bound should ignore 0, since
 such documents won't be involved in weighted queries).  And then store
 those in the `iamhoney` file.

--
Ticket URL: <https://trac.xapian.org/ticket/763#comment:2>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list