[Xapian-discuss] Re: Xapian query language

Thu Mar 30 20:35:44 BST 2006

On Thu, Mar 30, 2006 at 09:54:31AM -0800, Michel Pelletier wrote:
> """
> Get data stored in the document.
> 
> This is a potentially expensive operation, and shouldn't normally be 
> used in a match decider functor. Put data for use by match deciders in a 
> value instead.
> """
> 
> So values are also poor performance?  I feel like maybe there is a 
> terminology confusion here or something, James Aylett mentioned recently 
> that a lot of things were renamed, so hopefully we're bumping up against 
> that and values really aren't poor performance, for my sake. ;)

I don't think there is a terminology problem here, but don't panic!

The intention is that values are for holding auxiliary data for use during the
match process, and that the document data is used for holding data for use
after the match process.

The thinking is that a match process may consider many thousands of documents,
but only a few (often 10 or fewer) documents will actually be displayed as a
result of a match.  In a search doing a sort, or similar, this corresponds to
thousands of accesses to the values table, but only a few accesses to the
document data table.

Our aim is to keep the volume of IO to a minimum during the match, and
hopefully to cache most of the commonly read disk blocks.  For this reason,
information which is not useful during the search should not be stored in the
values, because this will cause more space to be taken up by the values table,
and a correspondingly greater likelihood that data has to be read from disk to
get the values for each document considered by the match process.

At present, the values are actually stored in a fairly similar way to the
document data.  However, this doesn't mean that they get accessed in a similar
pattern, or cached to a similar extent.  We'd hope that in a running system
which was making use of the values table, a good part of that table would be
cached at any given time.

So - the values aren't inherently "slow" to access - but you can make them so
by storing more data then needed in the value table.

In summary - only put information required during the match process in the
values.  Any data used after the match process (including data used for
post-processing result lists) should be placed in the document data.  If this
requires pickling a data structure, do that.  The CPU cost of doing so is
likely to be much less of an issue than the wait time for a disk block to be
read.

-- 
Richard