[Xapian-tickets] [Xapian] #423: Document termlist get_termfreq() method behaviour depends on whether terms are cached

Xapian nobody at xapian.org
Wed Mar 22 04:48:44 GMT 2023


#423: Document termlist get_termfreq() method behaviour depends on whether terms
are cached
-----------------------------+-------------------------------
 Reporter:  Richard Boulton  |             Owner:  Olly Betts
     Type:  defect           |            Status:  new
 Priority:  normal           |         Milestone:  1.5.0
Component:  Library API      |           Version:  SVN trunk
 Severity:  normal           |        Resolution:
 Keywords:                   |        Blocked By:
 Blocking:                   |  Operating System:  All
-----------------------------+-------------------------------
Changes (by Olly Betts):

 * milestone:  1.4.x => 1.5.0


Old description:

> The !TermIterator objects returned by Document.termlist_begin(), for a
> document obtained from a database, can sometimes be used to obtain the
> term frequency, and sometimes can't be.  It's unpredictable which
> behaviour is obtained, unless you know the details of the implementation
> of caching of terms in Documents.
>
> For example, the following python code uses this to obtain the frequency,
> and works fine:
>
> {{{
> import xapian
> db=xapian.WritableDatabase('foo', xapian.DB_CREATE_OR_OVERWRITE)
> doc=xapian.Document()
> doc.add_term('foo')
> db.add_document(doc)
> doc=db.get_document(1)
> t=doc.termlist()
> item=t.next()
> item.termfreq
> }}}
>
> However, the following code (with one added line) doesn't:
>
> {{{
> import xapian
> db=xapian.WritableDatabase('foo', xapian.DB_CREATE_OR_OVERWRITE)
> doc=xapian.Document()
> doc.add_term('foo')
> db.add_document(doc)
> doc=db.get_document(1)
> doc.termlist_count()   # Added line
> t=doc.termlist()
> item=t.next()
> item.termfreq
> }}}
>
> For me, this code raises: "!InvalidOperationError: Can't get term
> frequency from a document termlist which is not associated with a
> database."
>
> This behaviour is because the termlist_count() method causes the terms to
> be loaded into the document, and Document then uses a !MapTermList to
> return the term iterator.
>
> Not sure of the easiest way to fix this - we could make !MapTermList be
> able to keep a reference to a database, and pass off such requests to the
> database if set (or, better, subclass !MapTermList for documents which
> are connected to a database).

New description:

 The !TermIterator objects returned by Document.termlist_begin(), for a
 document obtained from a database, can sometimes be used to obtain the
 term frequency, and sometimes can't be.  It's unpredictable which
 behaviour is obtained, unless you know the details of the implementation
 of caching of terms in Documents.

 For example, the following python code uses this to obtain the frequency,
 and works fine:

 {{{
 import xapian
 db=xapian.WritableDatabase('foo', xapian.DB_CREATE_OR_OVERWRITE)
 doc=xapian.Document()
 doc.add_term('foo')
 db.add_document(doc)
 doc=db.get_document(1)
 t=doc.termlist()
 item=next(t)
 item.termfreq
 }}}

 However, the following code (with one added line) doesn't:

 {{{
 import xapian
 db=xapian.WritableDatabase('foo', xapian.DB_CREATE_OR_OVERWRITE)
 doc=xapian.Document()
 doc.add_term('foo')
 db.add_document(doc)
 doc=db.get_document(1)
 doc.termlist_count()   # Added line
 t=doc.termlist()
 item=next(t)
 item.termfreq
 }}}

 For me, this code raises: "!InvalidOperationError: Can't get term
 frequency from a document termlist which is not associated with a
 database."

 This behaviour is because the termlist_count() method causes the terms to
 be loaded into the document, and Document then uses a !MapTermList to
 return the term iterator.

 Not sure of the easiest way to fix this - we could make !MapTermList be
 able to keep a reference to a database, and pass off such requests to the
 database if set (or, better, subclass !MapTermList for documents which are
 connected to a database).

--
Comment:

 The original issue reported here was actually fixed a while ago by
 [747ba3ef354a2b0180b18f0d878e1add7697ae5b] which reimplemented
 `Document::Internal` and implemented `termlist_count()` differently
 (essentially in the way I suggested above).  That change wasn't backported
 so this will be fixed in 1.5.0.

 There's also the related issue that the termfreq reported here is only for
 the current subdatabase.

 Replying to [comment:7 Olly Betts]:
 > Perhaps we should only support `TermIterator::get_termfreq()` for an
 allterms iterator.

 I had a look at what that would involve.

 It needs extra work to make this fail in the document case without also
 making it fail for an iterator from `db.termlist_begin(docid)`.

 It requires that `REPLY_TERMLIST` stop eagerly calling `get_termfreq()`
 for each entry, which is a speed-up if this information isn't wanted, but
 probably a significant slow-down if it is.  However this change would
 break `get_eset()` with a remote shard.  Ideally that would do more work
 remotely, but I'm not sure how feasible that actually is when there are
 both local and remote shards.  It probably would need `get_eset()` to work
 more like the matcher, which would not be a bad thing but is a significant
 amount of development work.

 Overall I think it's best to resolve this part by documenting that the
 `TermIterator` here only knows about the shard (effectively the `Document`
 object only knows about the shard it is in since `get_docid()` reports the
 docid in the shard, so this is reasonably consistent).  If you really want
 the termfreq from the full database then you can keep a reference to it
 and call `db.get_termfreq(*term_iterator)`.
-- 
Ticket URL: <https://trac.xapian.org/ticket/423#comment:8>
Xapian <https://xapian.org/>
Xapian


More information about the Xapian-tickets mailing list