[Xapian-discuss] What are the separators that scriptindex uses?

Fri Sep 17 10:48:05 BST 2004

On Wed, Sep 15, 2004 at 03:48:54PM -0400, Jim Lynch wrote:
> I've been asked to find out what are considered separators for 
> scriptindex?

Essentially, non-alphanumerics.  But there's special handling for things
like "N.A.T.O.", "C++", and "AT&T".

> The reason for the question is that my data contains some 
> strange stuff, like output from core dumps, source code for various 
> programming languages like assembly, part numbers (not just numbers, of 
> course) and other wierd collections of funny characters.  Fortunately no 
> unicode just yet.  I'm trying to get a feel for how difficult it's going 
> to be to search for this stuff and what the rules might be. 

The following characters are treated as "phrase makers" by the
QueryParser: _/\@'*.- so for example an email address is indexed as
separate words, and a search for it triggers a phrase search.

> Also can I assume omega uses the same set of separators? 

Pretty much.  The indexer and QueryParser are designed to work together.

> For instance if I look for something like PARAM_DEV-445*Foggy, will it 
> be found?  Will it be multiple terms? 

It's be indexed as 4 terms, and searched for as a phrase of those 4
terms.

> BTW, how are phrase searches these days? 

Why do you ask?  Did you have a problem with them before?

As far as I know they work correctly.  They're inherently more expensive
than non phrase searches, and there are a couple of bugzilla entries
for related enhancements (one to improve term AND "some phrase"; the
other to reduce the number of cases where a phrase query is required
- e.g. "e-mail" uses a phrase at present).

Cheers,
    Olly