[Xapian-tickets] [Xapian] #442: Add support for mapping field names to differing term prefixes across multiple databases

Xapian nobody at xapian.org
Thu Feb 4 15:21:55 GMT 2010


#442: Add support for mapping field names to differing term prefixes across
multiple databases
-------------------------+--------------------------------------------------
 Reporter:  richard      |       Owner:  olly     
     Type:  enhancement  |      Status:  new      
 Priority:  normal       |   Milestone:  1.2.x    
Component:  Library API  |     Version:  SVN trunk
 Severity:  normal       |    Keywords:           
Blockedby:               |    Platform:  All      
 Blocking:               |  
-------------------------+--------------------------------------------------
 = Overview =

 If multiple databases use differing schemes for mapping field names to
 term prefixes, it's currently not possible to form a search across them.

 I'd like to be able to perform a mapping from field name to the correct
 prefix for a sub-database at the point at which the posting lists for a
 query are opened, instead of when building the query, in order to allow me
 to do such searches.

 = Rationale =

 When implementing higher level abstractions on top of Xapian (eg, Xappy),
 it is very common to want to introduce the concept of data split into
 fields.  The current recommended way to do this is to use a different
 prefix for each field.

 One problem with this approach is that, if the abstraction is to hide the
 implementation from users, it is necessary to allocate prefixes
 automatically.  There are two approaches; either use the field name as the
 prefix, with some appropriate escaping mechanism, or generate and store a
 mapping from fieldname to prefix (either the first time that a fieldname
 is used, or in a preliminary "schema" generation step).

 Using the field name as the prefix currently causes significant database
 bloat: particularly for long field names.  One benchmark: on an example
 large database (containing 47 fields with descriptive field names
 averaging 9.2 charaters in length), moving from xappy's allocated prefixes
 to the full field names increases the postlist table size by 3.0%, the
 position table size by 7.6% and the termlist size by 10.7% (with chert).

 For Xappy, we instead generate a mapping from field name to a short
 (usually 2 character) prefix, and store the mapping in metadata keys.
 This seems to work well, but has a major drawback: it is no longer
 possible to search across multiple databases unless they have an identical
 mapping for all the fields involved in a search.

 One solution to this problem may be provided by future database backends;
 it's possible that avoiding storing common prefixes in btree blocks will
 avoid the problem sufficiently.  However, at least one full fieldname
 would still need to be stored in each btree block, and I'm not convinced
 that this would remove the problem fully.

 Another approach would be to rework the Xapian API so that in all places
 where a term is supplied to the API, a (fieldname, term) pair is supplied
 instead, and have Xapian perform all the mapping from fieldnames to
 prefixes fully internally.  It might be possible to do this in a
 reasonably backwards compatible manner, by making the current versions of
 the methods store the terms in a field with name "".  The hard bit to do
 in a backwards compatible manner would be working out what to return from
 the APIs which currently return terms as single strings.

 = Suggested solution =

 I think it's undesirable to specify a fixed scheme for performing a
 mapping from fieldnames to prefixes, since it won't be appropriate for all
 users of Xapian (even the concept of "fields" isn't always appropriate).

 Instead, I suggest adding the ability to register a functor to be called
 when generating a leaf posting list from a leaf query.  This functor would
 be passed the database and the term from the query, and would return a
 term.  It would be able to use metadata stored in the sub-database to
 convert the term appropriately for that sub-database.

 A default functor could be defined which recognised a specific format for
 terms (possibly a 0-byte separating the fieldname from the value for that
 field, to allow any other character in fieldnames), and looked up the
 fieldname.  The default functor could use a standard set of metadata keys
 to look up fieldnames: eg "_F<fieldname>".  (Introducing the idea that an
 _ prefix would be used for "internal" metadata for Xapian, and using "F"
 (for field) to allow us to store additional internally-used-but-
 publically-visible metadata in future if desired.)

 The functor could either be specified by passing it to the Enquire class
 (like a weighting object), or could be attached to the Query in some way
 (potentially allowing different mappings for different parts of the query
 tree).

 = Issues =

  * To make this work for remote searches, the functor would need to be
 serialisable, and be registered with the Registry object.  This would
 limit the use of user defined functions instead of the built-in functor.
  * We'd probably want to add support to the TermGenerator for generating
 terms with the appropriate prefix, based on a fieldname and the prefix
 metadata values stored in a xapian database (possibly the TermGenerator
 would be able to take the same functor as used by Enquire/Query).
  * The return values for get_matching_terms() would be confusing to users,
 since they'd hold the converted terms.  We'd probably need to have some
 way to convert a prefix-term back to the original field-term (or, at
 least, to a field-term which would map to the prefix-term: mapping to the
 original term wouldn't be possible in general since two fieldnames could
 map to the same prefix).
  * This solution generally seems rather overly complex (both in
 implementation and API).  However, it's the best I've come up with so far
 which doesn't involve massive re-working of the Xapian API, or relying on
 as-yet unimplemented and unproven database storage improvements.

 A related issue to this is that it would be nice to support field-specific
 weighting schemes (eg, BM25-F), which need Xapian to spport a wider model
 of fields: in particular, implementing these would require Xapian to store
 "document length" values specific to fields.  Perhaps rather than
 following the approach advocated above, this means that it would be worth
 making the effort to add explicit support for fields to all the Xapian
 APIs which work with terms.

-- 
Ticket URL: <http://trac.xapian.org/ticket/442>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list