[Xapian-discuss] Clarification of values, data, fields, and prefixed terms

Tue Sep 4 16:31:24 BST 2007

On 9/4/07, James Aylett <james-xapian at tartarus.org> wrote:
> On Tue, Sep 04, 2007 at 01:15:03AM -0400, Deron Meranda wrote:
>
> > But I don't understand the performance arguments.  Even if looking up
> > one value on a document means that all the values are retrieved,
> > how is that different from fields inside the data part.  Doesn't it also
> > have to retrieve all the fields from the data just to get to one of them
> > as well?
>
> Yes, but you aren't doing that in the match process. You might do it
> for 100 documents, not several million that are checked during a
> search - if you store data in the values, then every document
> considered by the matcher is going to have that data pulled out (if
> you use any values at all).
>
> Does that help?

Yes, that makes a lot more sense now.  If in some future
release values were made more efficient then the need for
data fields would mostly go away.  But until then the choice
between them, even if functionally equivalent, can affect
performance.

So from what I've learned here's some rules-of-thumb that I've
deduced for dealing with document meta-data and deciding
between the apparent overlapping ways to store them...

If a piece of meta-data needs to be available for probabilistic
searching then employ some sort of term prefixing convention.
You can follow the Omega standardized term prefix rules, or
if you want you can make up your own prefixing rules.  If used,
term prefixing must be applied consistently to all terms, and not
just those representing the meta-data.

If a piece of data will be most useful for partitioning (boolean
searching), for sorting match results, or for range-searching
then store it in a value.  Values should be "normalized" into
some format that preserves simple byte-by-byte lexigraphical
sorting (such as zero-padding numbers).

If a piece of data will not be used directly for doing searching
or producing a match set (although it could be used for the
*presentation* of the result set), then you should either:
   a) store it in some external database, or
   b) store it in the Xapian "data" area.
The former may be more appropriate where Xapian is
just part of a larger system.  For the later you can optionally
format the data area into a set of field/value pairs as Omega
does, or you can put any kind of binary blob into it you wish.

Also if a piece of meta-data needs to be easily extracted
from a document in the match set (such as a unique
document ID# or URN) then it should either be put into
a value or a data field--don't just store it as a prefixed term.

Does that more or less summarize the intended use of
the different mechanisms?
-- 
Deron Meranda