[Xapian-discuss] Re: searching and sorting by date

Thu Mar 23 18:07:27 GMT 2006

James Aylett wrote:
> On Wed, Mar 22, 2006 at 12:19:26PM -0800, Michel Pelletier wrote:
> 
> 
>>I did exclude it for simplicity.  One question I have about the 
>>technique described in the above link, so if I have subcategory=41, that 
>>will be encoded as the term SUBCATEGORY41.  All that makes sense, and I 
>>have been able to confirm that works.
> 
> 
> I'd probably recommend calling it XSUBCATEGORY41 instead. (X as a
> first character is reserved, and so no other "standard" Xapian app
> will ever generate terms starting that - it's intended for
> domain-specific use.)
> 
> S is used as the subject prefix - however it will never contain any
> further capital letters after it in a subject term. So it's not
> actually a problem to generate SUBCATEGORY41 if you want to (providing
> you never generate Subcategory41, which will be identical to finding
> "ubcategory41" in the subject of a document - unlikely though that may
> be!).
> 
> 
>>But what if someone who was authoring the document text used the term 
>>SUBCATEGORY41, either inadvertantly or on purpose.  Would that skew 
>>their document into more search results than would normally be the case? 
>>is there a way to prevent that? Do I have to keep my application 
>>specific terms and searchable text in two different databases?
> 
> 
> Generally you will take upper-cased words and do the following:
> 
>  * (stem and) index the lower-cased version
>  * index the lower-cased version prefixed by "R"
> 
> So if you had "SUBCATEGORY41" in the text, you'd generate two terms:
> "subcategory41" and "Rsubcategory41". The former will happily match
> "subcategory41" in a query, and the latter will match "SUBCATEGORY41"
> in a query (as well as "SubCategory41" etc. etc.).
> 
> (You probably already know this, but you'll want to call
> add_prefix("subcategory", "SUBCATEGORY") or something on your
> QueryParser object...)

So let me make sure I understand this as this prefix technique as I'm 
just now starting to grasp its significance.  Terms that come from text 
should be indexed as lower case and with the special prefix 'R' (is 
there something special about 'R', or did you just pick one at random?) 
is this so you can differentiate them from other terms that are to be 
used as boolean search terms are prefixed with one capital letter?

That's the part I don't get, so FOO45 is the same as Foo45?  Doesn't 
this mean that there can only be 26 kinds of specially prefixed terms in 
any xapian application?  Does xapian treat initially capped terms in 
some special way at index or query time?  Or is it just a trick to get 
the query parser to recognize "subcategory:41"?

I guess in general I'm having a hard time grasping the prefix idea, I 
understand its purpose, but without a complete example or a thorough 
description on the web page it's difficult to see how it works.  Does 
anyone know of a good example, perferrable in Python but I can work in 
just about anything, that shows documents with both text and boolean 
terms being indexed and queried?

-Michel