[Xapian-discuss] Clarification of values, data, fields, and prefixed terms

James Aylett james-xapian at tartarus.org
Sun Sep 2 13:46:11 BST 2007


On Thu, Aug 30, 2007 at 05:31:23PM -0400, Deron Meranda wrote:

> I'm fairly new to Xapian and one of the more confusing hurdles to
> understand is the different ways to attach meta-data to documents.  It
> seems like there are several different ways:
> 
>  * values
>  * data (which can then by convention be formatted into fields)
>  * prefixed terms

Each of these has a distinct use in Xapian. Two (values and prefixes
terms) are giving different types of metadata that Xapian itself can
use; the other (data) is for application metadata that Xapian can
happily ignore.

> Values are user-defined discrete strings (identified by a "slot"
> number).  A document can have either zero or exactly one value for any
> given slot number.  Xapian does not interpret the meaning of the value
> string nor does it predefine any slots, but it does allow for
> filtering queries based upon a simple lexigraphical "range" of values
> that matched documents should posses.

Values are used for filtering in the match process. So collapsing can
be done on a value; you can use them in a MatchDecider and so
on. Range filtering is another example, as you point out.

> Prefixed terms index documents just like ordinary terms/words and thus
> are used in probabiistic searches, and can carry positional
> information if desired.  Prefix terms are really just a convention
> (not part of Xapian core) by prepending some letters to the front of
> terms before they are put in the index.

Right. As far as Xapian is concerned, you just have a bunch of
terms. How you create those terms, and your convention for term
construction, is an important part of your index plan. Prefixes are a
useful convention for reflecting document data/metadata structure in
the terms you generate.

> Finally, document Data is just an opaque bunch of data attached to the
> document.  It can not be used as part of a query (although
> applications built on top of the core can use them for processing and
> displaying the search results).  By a convention of the Omega
> application (not the core), the data is formatted as a multiline text
> suppliment, where each line is like "field=value", and thus allows one
> to define fields for capturing meta data on a document.

Yes. Data is just somewhere to shove stuff that Xapian doesn't have to
care about. This could be as simple as the id of the document
somewhere else, or contain summary metadata (or in theory the entire
thing, although often that's not going to be a great idea).

> Is my understanding essentially correct?  Also, why would one ever use
> the data fields rather than values?

I'm not certain that it is actually true right now, but in theory
you'll get better performance in some cases by using values as they're
intended (to be looked up and used during the match process), and data
as it's intended (to store additional metadata that Xapian doesn't
care about, for display/whatever in your application).

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list