[Xapian-discuss] Underscores and colons
Olly Betts
olly at survex.com
Fri Oct 28 09:50:53 BST 2005
On Thu, Oct 27, 2005 at 04:50:12PM -0700, John Wang wrote:
> Is there a way to make underscores and colons in terms behave like letters?
> It would be nice for query terms like doc_id and Search::Xapian to be
> treated as one term, not two. The results would be a lot more relevant for
> some queries.
Terms can contain any characters (even zero bytes). You don't say how
you're generating them, but I guess you must be using Omega...
Omega's current strategy is to split terms on characters like underscore
and colon, and to let _ and : in a query generate a phrase search. So
the query Search::Xapian is the same as the query "Search Xapian".
One benefit of this is that a query for Xapian matches Search::Xapian
in a document, which is usually desirable. That's probably less of
a benefit for underscore, but it is how Omega currently handles it.
The QueryParser class also assumes you're doing this, because it was
originally part of Omega. That needs fixing - the tokenisation should
be configurable there.
This use of phrase searches does cause slow searches on large databases
sometimes:
http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=22
It's also annoying if you don't want to support actual phrase searches
but do want underscored terms, etc to work.
I'm working on addressing this issue. Currently by working on flint
which will make access to positional information faster, but I'm also
intending to revisit the tokenisation rules.
Cheers,
Olly
More information about the Xapian-discuss
mailing list