[Xapian-discuss] Re: searching and sorting by date

Fri Mar 24 09:46:43 GMT 2006

On Thu, Mar 23, 2006 at 10:07:27AM -0800, Michel Pelletier wrote:

> So let me make sure I understand this as this prefix technique as I'm 
> just now starting to grasp its significance.  Terms that come from text 
> should be indexed as lower case and with the special prefix 'R' (is 
> there something special about 'R', or did you just pick one at random?) 
> is this so you can differentiate them from other terms that are to be 
> used as boolean search terms are prefixed with one capital letter?

The idea is that any term that starts with a capital letter is
"special". The capital letters make up a "prefix", so you might have a
prefix to represent different headers if you were indexing email.

When generating terms, then, you either generate them unprefixed (so
all lowercase) or prefixed (so all lowercase, preceded by the
prefix). If the word you're generating the term from is lowercase, all
is well; if not, you generate two terms, one lowercased and one "raw"
term (using 'R'). (If you have a prefix *and* you're
generating a raw term the prefix comes first, and if the prefix is
more than one character you put a ':' between the two - see the top of
indextext.cc:index_text() in the omega source.)

Raw terms are actually more to do with stemming - we don't stem when
generating raw terms, but we do for all others. (Sorry, I may not have
made that clear before.) I imagine this is because capitalised words
in English are likely to be names, where stemming isn't that helpful
(Richard or Olly should be able to confirm this).

----------------------------------------------------------------------
Subject: My house

Some words.
----------------------------------------------------------------------

With 'S' being the prefix for Subject, might generate the following terms:

 Smy
 SRmy
 Shouse
 some
 Rsome
 words

> That's the part I don't get, so FOO45 is the same as Foo45?  Doesn't 
> this mean that there can only be 26 kinds of specially prefixed terms in 
> any xapian application?  Does xapian treat initially capped terms in 
> some special way at index or query time?  Or is it just a trick to get 
> the query parser to recognize "subcategory:41"?

It's easy to get confused :-)

 * a WORD is something we get out of the text, by splitting it into
   word-like chunks
 * a TERM is what we generate out of a WORD, and stuff into a Xapian
   database
 * xapian doesn't care what TERMS look like
 * omega (and other applications that use omega's TERM convention)
   uses the above prefixing system when creating TERMS
 * QueryParser uses the same convention, so applications using it
   should generate their TERMS in the same way

So the WORD "FOO45" and the WORD "Foo45" will both generate two TERMS
"foo45" and "Rfoo45".

The QueryParser has a mechanism for turning "prefix:word" into
an appropriate TERM using this convenient, by mapping the "prefix" bit
to the appropriate capital letters prefix in the TERM.

Is that clearer? You could have a look at omega/docs/termprefixes.txt
and see if that helps.

> I guess in general I'm having a hard time grasping the prefix idea, I 
> understand its purpose, but without a complete example or a thorough 
> description on the web page it's difficult to see how it works.  Does 
> anyone know of a good example, perferrable in Python but I can work in 
> just about anything, that shows documents with both text and boolean 
> terms being indexed and queried?

omega/indextext.cc contains the term generation code that Omega
uses. Alternatively I have a prototype email indexer written in python
- let me know if you want that (it's under GPL). However it's
basically just the indextext term generator in python, with a whole
load of stuff to handle the email format, so it's not that clear.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org