[Xapian-tickets] [Xapian] #763: Track unique term bounds for documents in the collection

Xapian nobody at xapian.org
Sat Jul 28 02:06:24 BST 2018

#763: Track unique term bounds for documents in the collection
 Reporter:  gp1308       |             Owner:  gp1308
     Type:  enhancement  |            Status:  new
 Priority:  normal       |         Milestone:
Component:  Library API  |           Version:
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All

Comment (by olly):

 Can you show a patch of the changes you're talking about in comment:7?

 Replying to [comment:8 gp1308]:
 > Also Instead of using `current_wdf` of each term, `termlist_size` can be
 used to update bounds for the number of unique terms?

 The correct count of unique terms ought to exclude those for which `wdf ==
 0`, so to get that we'd need to actually look at `current_wdf` -
 `termlist_size` will often be more than the correct value.

 However, currently for efficiency we approximate like this:

 GlassTermList::get_unique_terms() const
     LOGCALL(DB, Xapian::termcount, "GlassTermList::get_unique_terms",
     // get_unique_terms() really ought to only count terms with wdf > 0,
     // that's expensive to calculate on demand, so for now let's just
     // unique_terms <= doclen.
     RETURN(min(termlist_size, doclen));

 So the bound here needs to based on the same thing, so it's actually a
 bound on the value that can return.

 At some point it's likely we'll start storing the number of unique terms
 in a similar way to how we store the document length.  That's probably not
 going to happen for glass now though, as it would be hard to start doing
 so compatibly.

Ticket URL: <https://trac.xapian.org/ticket/763#comment:9>
Xapian <https://xapian.org/>

More information about the Xapian-tickets mailing list