[Xapian-devel] Some Questions From the beginner of Xapian

Olly Betts olly at survex.com
Wed Sep 17 08:26:05 BST 2008

On Wed, Sep 17, 2008 at 06:13:40AM +0000, Dave Spencer wrote:
> It would be nice if there was some page on "concepts" that covered this


> I've wondered what the intent of get_data and set_data was, esp why have
> the indexed values (the index being the first arg to get/add value) whereas
> with data it's just a single value -- why not have multiple "data" values,
> or why not get rid of "data" and just let the get/add value calls cover it?

Use values if you need fast access during the match process itself (e.g.
for sorting, collapsing, etc).  Then Xapian knows to store the data such
that this can be done efficiently.  If you're sorting by date, Xapian
only needs date information and doesn't want to have to fetch extraneous
data to get it - this is why there are multiple value slots (the current
implementation doesn't make best use of this but I'm working on that at
the moment as it happens!)

Optimising the storage scheme for this use case will hurt other access
patterns, so we advise against storing arbitrary "data fields" in value
slots.  If you need to store other data which isn't needed in this way
(e.g. you want it for displaying results), serialise it into the
document data instead.

There are already plenty of existing ways to serialise structured data
into a single string, so when we were originally building Xapian we just
chose a simple approach which allows you to pick an existing solution
you like (some examples: XML, Python's pickle, JSON, Omega's
"name=value" scheme) and allowed us to get on with the rest of the job.

At some point I think we probably will add support for some sort of
document fields.  Verbosity is more of an issue here than in most
situations, so it's not just a case of reinventing the wheel, and
we may be able to reuse an existing solution anyway.

A numerically subscripted array of strings doesn't add much generality
though - if you want to store any other sort of structure or any
non-string data, you're still going to have to serialise it to one or
more strings.  I think we probably should aim higher.

There's a ticket tracking this issue:


> I'm guessing the intent of 'data' is to store some key piece of info
> about a document such as the URL of a doc that represents a web page.

One *or more* pieces of information, but otherwise yes.


More information about the Xapian-devel mailing list