[Xapian-discuss] Clarification of values, data, fields, and prefixed terms

Deron Meranda deron.meranda at gmail.com
Thu Aug 30 22:31:23 BST 2007


I'm fairly new to Xapian and one of the more confusing hurdles to
understand is the different ways to attach meta-data to documents.  It
seems like there are several different ways:

 * values
 * data (which can then by convention be formatted into fields)
 * prefixed terms

I have not really seen a clear description of these in one place and
why you would use one over another.  Here's my understanding, please
fill in or correct me...


Values are user-defined discrete strings (identified by a "slot"
number).  A document can have either zero or exactly one value for any
given slot number.  Xapian does not interpret the meaning of the value
string nor does it predefine any slots, but it does allow for
filtering queries based upon a simple lexigraphical "range" of values
that matched documents should posses.

Prefixed terms index documents just like ordinary terms/words and thus
are used in probabiistic searches, and can carry positional
information if desired.  Prefix terms are really just a convention
(not part of Xapian core) by prepending some letters to the front of
terms before they are put in the index.  This usually means that even
normal terms derived from the words in a document need to be prefixed
as well so everything remains unambiguous.  It is common for prefixed
terms to describe additional information about the document (such as
document id, URL, etc) other than just the actual words appearing in
the document text.  Prefixed terms essentially attach type or semantic
meaning to terms by the selection of the prefix letter(s).  Within the
Omega application  framework many predefined prefixes are
standardized.  The core's QueryParser has limited support for prefix
terms by mapping a "field name" to a prefix, but otherwise the core
does not distinguish a prefixed-term from a non-prefixed one.

Prefixed terms can also be used when you want to index both a stemmed
word as well as the original unmodified word, while not inhibiting the
ability to do phrase (near) searching.

Finally, document Data is just an opaque bunch of data attached to the
document.  It can not be used as part of a query (although
applications built on top of the core can use them for processing and
displaying the search results).  By a convention of the Omega
application (not the core), the data is formatted as a multiline text
suppliment, where each line is like "field=value", and thus allows one
to define fields for capturing meta data on a document.


Is my understanding essentially correct?  Also, why would one ever use
the data fields rather than values?

Deron Meranda



More information about the Xapian-discuss mailing list