[Xapian-tickets] [Xapian] #442: Add support for mapping field names to differing term prefixes across multiple databases
Xapian
nobody at xapian.org
Thu Feb 4 15:21:55 GMT 2010
#442: Add support for mapping field names to differing term prefixes across
multiple databases
-------------------------+--------------------------------------------------
Reporter: richard | Owner: olly
Type: enhancement | Status: new
Priority: normal | Milestone: 1.2.x
Component: Library API | Version: SVN trunk
Severity: normal | Keywords:
Blockedby: | Platform: All
Blocking: |
-------------------------+--------------------------------------------------
= Overview =
If multiple databases use differing schemes for mapping field names to
term prefixes, it's currently not possible to form a search across them.
I'd like to be able to perform a mapping from field name to the correct
prefix for a sub-database at the point at which the posting lists for a
query are opened, instead of when building the query, in order to allow me
to do such searches.
= Rationale =
When implementing higher level abstractions on top of Xapian (eg, Xappy),
it is very common to want to introduce the concept of data split into
fields. The current recommended way to do this is to use a different
prefix for each field.
One problem with this approach is that, if the abstraction is to hide the
implementation from users, it is necessary to allocate prefixes
automatically. There are two approaches; either use the field name as the
prefix, with some appropriate escaping mechanism, or generate and store a
mapping from fieldname to prefix (either the first time that a fieldname
is used, or in a preliminary "schema" generation step).
Using the field name as the prefix currently causes significant database
bloat: particularly for long field names. One benchmark: on an example
large database (containing 47 fields with descriptive field names
averaging 9.2 charaters in length), moving from xappy's allocated prefixes
to the full field names increases the postlist table size by 3.0%, the
position table size by 7.6% and the termlist size by 10.7% (with chert).
For Xappy, we instead generate a mapping from field name to a short
(usually 2 character) prefix, and store the mapping in metadata keys.
This seems to work well, but has a major drawback: it is no longer
possible to search across multiple databases unless they have an identical
mapping for all the fields involved in a search.
One solution to this problem may be provided by future database backends;
it's possible that avoiding storing common prefixes in btree blocks will
avoid the problem sufficiently. However, at least one full fieldname
would still need to be stored in each btree block, and I'm not convinced
that this would remove the problem fully.
Another approach would be to rework the Xapian API so that in all places
where a term is supplied to the API, a (fieldname, term) pair is supplied
instead, and have Xapian perform all the mapping from fieldnames to
prefixes fully internally. It might be possible to do this in a
reasonably backwards compatible manner, by making the current versions of
the methods store the terms in a field with name "". The hard bit to do
in a backwards compatible manner would be working out what to return from
the APIs which currently return terms as single strings.
= Suggested solution =
I think it's undesirable to specify a fixed scheme for performing a
mapping from fieldnames to prefixes, since it won't be appropriate for all
users of Xapian (even the concept of "fields" isn't always appropriate).
Instead, I suggest adding the ability to register a functor to be called
when generating a leaf posting list from a leaf query. This functor would
be passed the database and the term from the query, and would return a
term. It would be able to use metadata stored in the sub-database to
convert the term appropriately for that sub-database.
A default functor could be defined which recognised a specific format for
terms (possibly a 0-byte separating the fieldname from the value for that
field, to allow any other character in fieldnames), and looked up the
fieldname. The default functor could use a standard set of metadata keys
to look up fieldnames: eg "_F<fieldname>". (Introducing the idea that an
_ prefix would be used for "internal" metadata for Xapian, and using "F"
(for field) to allow us to store additional internally-used-but-
publically-visible metadata in future if desired.)
The functor could either be specified by passing it to the Enquire class
(like a weighting object), or could be attached to the Query in some way
(potentially allowing different mappings for different parts of the query
tree).
= Issues =
* To make this work for remote searches, the functor would need to be
serialisable, and be registered with the Registry object. This would
limit the use of user defined functions instead of the built-in functor.
* We'd probably want to add support to the TermGenerator for generating
terms with the appropriate prefix, based on a fieldname and the prefix
metadata values stored in a xapian database (possibly the TermGenerator
would be able to take the same functor as used by Enquire/Query).
* The return values for get_matching_terms() would be confusing to users,
since they'd hold the converted terms. We'd probably need to have some
way to convert a prefix-term back to the original field-term (or, at
least, to a field-term which would map to the prefix-term: mapping to the
original term wouldn't be possible in general since two fieldnames could
map to the same prefix).
* This solution generally seems rather overly complex (both in
implementation and API). However, it's the best I've come up with so far
which doesn't involve massive re-working of the Xapian API, or relying on
as-yet unimplemented and unproven database storage improvements.
A related issue to this is that it would be nice to support field-specific
weighting schemes (eg, BM25-F), which need Xapian to spport a wider model
of fields: in particular, implementing these would require Xapian to store
"document length" values specific to fields. Perhaps rather than
following the approach advocated above, this means that it would be worth
making the effort to add explicit support for fields to all the Xapian
APIs which work with terms.
--
Ticket URL: <http://trac.xapian.org/ticket/442>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list