[Xapian-discuss] What are the separators that scriptindex uses?
Olly Betts
olly at survex.com
Fri Sep 17 10:48:05 BST 2004
On Wed, Sep 15, 2004 at 03:48:54PM -0400, Jim Lynch wrote:
> I've been asked to find out what are considered separators for
> scriptindex?
Essentially, non-alphanumerics. But there's special handling for things
like "N.A.T.O.", "C++", and "AT&T".
> The reason for the question is that my data contains some
> strange stuff, like output from core dumps, source code for various
> programming languages like assembly, part numbers (not just numbers, of
> course) and other wierd collections of funny characters. Fortunately no
> unicode just yet. I'm trying to get a feel for how difficult it's going
> to be to search for this stuff and what the rules might be.
The following characters are treated as "phrase makers" by the
QueryParser: _/\@'*.- so for example an email address is indexed as
separate words, and a search for it triggers a phrase search.
> Also can I assume omega uses the same set of separators?
Pretty much. The indexer and QueryParser are designed to work together.
> For instance if I look for something like PARAM_DEV-445*Foggy, will it
> be found? Will it be multiple terms?
It's be indexed as 4 terms, and searched for as a phrase of those 4
terms.
> BTW, how are phrase searches these days?
Why do you ask? Did you have a problem with them before?
As far as I know they work correctly. They're inherently more expensive
than non phrase searches, and there are a couple of bugzilla entries
for related enhancements (one to improve term AND "some phrase"; the
other to reduce the number of cases where a phrase query is required
- e.g. "e-mail" uses a phrase at present).
Cheers,
Olly
More information about the Xapian-discuss
mailing list