[Xapian-discuss] word context, numeric values, and characters
Olly Betts
olly at survex.com
Wed Dec 7 04:37:30 GMT 2005
On Tue, Dec 06, 2005 at 09:22:13PM -0600, Peter Karman wrote:
> Has any thought been given to storing the context separately from the word
> string itself? E.g.,
>
> add_term( "foo", wdfinc, "T" )
Richard wondered about doing it several years ago, but I don't think he
actually worked on it.
> Would that slow down queries, by having to do a separate compare against a
> different value stored elsewhere? I'm assuming the reason the current
> convention looks the way it does for speed in searching.
My first thought on how to store it internally would be to combine the
prefix and term to produce the key for the postlist B-tree (where
currently we just use the term). So that shouldn't be slower, other
than it would probably make the keys a byte longer than at present
(perhaps only for prefixed terms - I've not thought it through in
detail...)
The "capital letter prefix" scheme is actually inherited from the
proprietary Muscat 3.6 system which Xapian was originally being written
to replace. We kept it because it's simple and does the job. But it
would be cleaner to keep the term and prefix separate. That would
require some API redesign as we'd need to decide what to do where we
currently returns a term as a std::string. I'm not against it
necessarily, but it seems a lot of work to replace something which works
pretty well in practice, and I can see more urgent issues.
> I ask because it seems like there exists the possibility of missing matches
> (or false positives) if you wanted to include the ':' as a valid word
> character, as in indexing source code, for example. If I wanted to find
> "foo::bar" exactly, and not the phrase "foo bar", and I happened to have a
> prefix called "foo", then might things get sticky?
Well, there's two issues here - the term generation and the query
parsing.
If you use the standard convention of forcing terms to lower-case and
using upper-case prefixes, there's clearly no ambiguity in the term
generation.
If you try to create an ambiguous situation then Omega, scriptindex, and
the QueryParser will insert a colon between the prefix and term in an
attempt to save you from yourself (and this should generally work
though with a bit of thought I'm sure you could defeat it with a
suitably contrived situation). Note that this colon is in the term as
it exists in the database.
But your example is of a colon in the user's query. That's a different
issue.
The QueryParser class will never identify "foo::bar" as having a prefix
since it checks after the ':' for an alphanumeric character before
accepting it as a prefix (or '(' or '"' to support prefixed
sub-expressions and phrases).
And the QueryParser doesn't allow you to specify that a colon is a valid
word character either (at least not currently).
*If* colon was a valid word character, then I think we'd just have to
decide how to handle a query like "mailto:user at example.com" if "mailto"
was a specified prefix, and then document how it will be handled. I
tend to think that we should honour the explicit prefix, but if there's
a real world situation in which that's problematic we could perhaps see
if it exists as a prefixed term and if not see if it exists as a term
with an embedded ':' or something like that.
Note that "mailto:user at example.com" will generally map to a term like
XMAILTOuser at example.com, so foo:colon:term would produce XFOOcolon:term
which is only a problem if you have an XFOOcolon prefix too...
So to summarise: yes - here be dragons! But you won't go near them if
you stick to using all-caps prefixes.
Cheers,
Olly
More information about the Xapian-discuss
mailing list