[Xapian-discuss] Re: searching and sorting by date
Michel Pelletier
michel at dialnetwork.com
Thu Mar 23 18:07:27 GMT 2006
James Aylett wrote:
> On Wed, Mar 22, 2006 at 12:19:26PM -0800, Michel Pelletier wrote:
>
>
>>I did exclude it for simplicity. One question I have about the
>>technique described in the above link, so if I have subcategory=41, that
>>will be encoded as the term SUBCATEGORY41. All that makes sense, and I
>>have been able to confirm that works.
>
>
> I'd probably recommend calling it XSUBCATEGORY41 instead. (X as a
> first character is reserved, and so no other "standard" Xapian app
> will ever generate terms starting that - it's intended for
> domain-specific use.)
>
> S is used as the subject prefix - however it will never contain any
> further capital letters after it in a subject term. So it's not
> actually a problem to generate SUBCATEGORY41 if you want to (providing
> you never generate Subcategory41, which will be identical to finding
> "ubcategory41" in the subject of a document - unlikely though that may
> be!).
>
>
>>But what if someone who was authoring the document text used the term
>>SUBCATEGORY41, either inadvertantly or on purpose. Would that skew
>>their document into more search results than would normally be the case?
>>is there a way to prevent that? Do I have to keep my application
>>specific terms and searchable text in two different databases?
>
>
> Generally you will take upper-cased words and do the following:
>
> * (stem and) index the lower-cased version
> * index the lower-cased version prefixed by "R"
>
> So if you had "SUBCATEGORY41" in the text, you'd generate two terms:
> "subcategory41" and "Rsubcategory41". The former will happily match
> "subcategory41" in a query, and the latter will match "SUBCATEGORY41"
> in a query (as well as "SubCategory41" etc. etc.).
>
> (You probably already know this, but you'll want to call
> add_prefix("subcategory", "SUBCATEGORY") or something on your
> QueryParser object...)
So let me make sure I understand this as this prefix technique as I'm
just now starting to grasp its significance. Terms that come from text
should be indexed as lower case and with the special prefix 'R' (is
there something special about 'R', or did you just pick one at random?)
is this so you can differentiate them from other terms that are to be
used as boolean search terms are prefixed with one capital letter?
That's the part I don't get, so FOO45 is the same as Foo45? Doesn't
this mean that there can only be 26 kinds of specially prefixed terms in
any xapian application? Does xapian treat initially capped terms in
some special way at index or query time? Or is it just a trick to get
the query parser to recognize "subcategory:41"?
I guess in general I'm having a hard time grasping the prefix idea, I
understand its purpose, but without a complete example or a thorough
description on the web page it's difficult to see how it works. Does
anyone know of a good example, perferrable in Python but I can work in
just about anything, that shows documents with both text and boolean
terms being indexed and queried?
-Michel
More information about the Xapian-discuss
mailing list