About xapian serialization on float/double variables
james at tartarus.org
Tue Jan 22 19:54:13 GMT 2019
The main thing to be aware of is that there are uses for sort keys (and indeed values in general) where the value isn't a number to start off with. For instance, a catalogue of paintings might have stored values of the title or the artist, both of which could be usefully sorted on.
Similarly, one of the examples in the getting started guide  stores a date as a string in YYYYMMDD format. Although it would be possible to convert this into a number, that has some complexity in the general case (particularly around calendar changes). For most Western uses, YYYYMMDD is easy to calculate and to debug, and acceptable as both a sort key and for range queries.
Just a slight point as well: you talk about "sort_key related fields", but they aren't fields in the way most people would use the word: from the database's point of view there are just values, which have some specific use cases (fields tend to be serialised into document data). Values only become sort_key related at query time (although you will probably have designed them for one or more of their intended uses).
When you say "sort_keys will be unserialized when user needs to read its real float/double values…", that's not really an anticipated way of working, because for display or further processing you'd usually pull things out of the document data at this point. (Values are designed to be fast to access during matching — and aren't necessarily performant in other situations.)
Hope that helps a little!
> On 22 Jan 2019, at 03:37, Miao LIU <miaoliu95 at acm.org> wrote:
> Dear Members of Xapian Project,
> Sorry for troubling you this time. It can be witnessed that xapian will store Document values with serialization approach when given value types meet float/double.
> Such an approach is deployed on sort_key related fields as well, where the xapian requires KeyMaker::operator() must return an serialized float/dobule variable. Then heap sort comes and ranks the vector<MSetItem> items (multimatch.cc MultiMatch::get_mset()) by comparing serialized sort_keys (std::string) straightforwardly according to <IEEE-754 doubles>. Subsequently sort_keys will be unserialized when user needs to read its real float/double values during iterations of result MSet.
> Obviously, serialization and unserialization are time-consuming operations. Compared with defining and using sort_key as float/double type directly, it is complicated to understand benefits of such serialization above in both performance and coding aspects.
> It will be very kind of you if you could give a short illustration. Looking forward to your early reply.
> Best Regards,
> Miao LIU
devfort.com — spacelog.org — tartarus.org/james/
More information about the Xapian-discuss