[Xapian-discuss] New to Xapian (coming from Lucene)

Olly Betts olly at survex.com
Fri Apr 13 18:46:28 BST 2007


On Fri, Apr 13, 2007 at 12:48:02PM -0400, Jeff Anderson wrote:
> Yeah, that's extremely clunky for my needs. I'd much rather work an
> API that allows me to specify product_id, price, and weight as words.
> Not numbers. I know C++ has hashes. :)

You can just set the mapping in your source code - e.g. for C++:

const Xapian::valueno VALUE_PRODUCT_ID = 0;
const Xapian::valueno VALUE_PRICE = 1;
const Xapian::valueno VALUE_WEIGHT = 2;

The original reasons when this feature was added (well over 5 years ago)
for using numbered rather than named slots was for efficiency of storage
in the backend.  With named slots, you either store a mapping somewhere
which needs versioning which makes update more complex and hence slower
(well, that was the thinking then - I now suspect this is probably
flawed as the set of mappings will only grow a few times), or you store
the name with every use of the value, which adds up to a lot of space
for a large database.

Also, it's easy enough to name the slots at the application code level,
as above.

We're intending to move to a different way of storing values, which will
be more like a stream in docid order for each value number.  That would
make the cost of storing the value name "as is" much lower, because it
would only be needed once for each chunk in the stream.  I also think
that a "name->number" map would probably work OK as I said above.

So I've been thinking we should consider moving from "value number" to
"value name" - it wouldn't be too hard to overload the methods and
support both, at least for a transition period, and perhaps
indefinitely.

>    $doc->set_value( url         => $product{url} );
>    $doc->set_value( title       => $product{title} );
>    $doc->set_value( isbn      => $product{isbn_10} );
> [...]
>        $hit->{title},
>        $hit->{url},
>        $hit->{description},
> [...]
> 
> I shouldn't be required to know that description is the third value,
> or title is the first.

Incidentally, you shouldn't use values as fields like this.  They're
designed specifically to hold small pieces of data which need to be
accessed quickly during the matching process, for purposes like sorting,
collapsing duplicate or similar matches together, implementing date,
price, weight, etc ranges.  So the strategy for storing them assumes
that speed of access is more important than compactness of storage, etc.

Fields you'll only want for displaying a result to the user should be
stored in the document data.  That's stored optimised for compactness
assuming you'll want to retrieve a handful of disparate entries for
displaying a page of hits, or similar.

It only really matters for a large system, but most large systems
start as small systems.

> And all it takes is improving the API just a bit. Make new set_data()
> and get_data() methods that take optional keys.

Providing a standard (but optional) way to store key,value pairs in the
document data is definitely a planned feature - there's even a wishlist
bug for it:

http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=53

> I just don't see why the Xapian API wouldn't supply such. :(

Lack of infinite time!

Cheers,
    Olly



More information about the Xapian-discuss mailing list