[Xapian-discuss] Re: searching and sorting by date

James Aylett james-xapian at tartarus.org
Wed Mar 22 22:26:07 GMT 2006


On Wed, Mar 22, 2006 at 12:19:26PM -0800, Michel Pelletier wrote:

> I did exclude it for simplicity.  One question I have about the 
> technique described in the above link, so if I have subcategory=41, that 
> will be encoded as the term SUBCATEGORY41.  All that makes sense, and I 
> have been able to confirm that works.

I'd probably recommend calling it XSUBCATEGORY41 instead. (X as a
first character is reserved, and so no other "standard" Xapian app
will ever generate terms starting that - it's intended for
domain-specific use.)

S is used as the subject prefix - however it will never contain any
further capital letters after it in a subject term. So it's not
actually a problem to generate SUBCATEGORY41 if you want to (providing
you never generate Subcategory41, which will be identical to finding
"ubcategory41" in the subject of a document - unlikely though that may
be!).

> But what if someone who was authoring the document text used the term 
> SUBCATEGORY41, either inadvertantly or on purpose.  Would that skew 
> their document into more search results than would normally be the case? 
> is there a way to prevent that? Do I have to keep my application 
> specific terms and searchable text in two different databases?

Generally you will take upper-cased words and do the following:

 * (stem and) index the lower-cased version
 * index the lower-cased version prefixed by "R"

So if you had "SUBCATEGORY41" in the text, you'd generate two terms:
"subcategory41" and "Rsubcategory41". The former will happily match
"subcategory41" in a query, and the latter will match "SUBCATEGORY41"
in a query (as well as "SubCategory41" etc. etc.).

(You probably already know this, but you'll want to call
add_prefix("subcategory", "SUBCATEGORY") or something on your
QueryParser object...)

> >And I have difficulty with helping you because I don't know how Xapwrap 
> >indexes documents.
> 
> Yes, it is a bit tricky, the API is simple but as Jarrod pointed out 
> there are great benefits to going straight with the swig wrapper.  I'll 
> probably end up taking his advice once I figure this all out. ;)

I had a look at Xapwrap a while ago, and couldn't actually figure out
what it was giving me over the straight bindings. On the other hand, I
implemented the Python bindings and I know the Xapian API pretty well
:-)

If you use the bindings, it's likely that more people on this list
will be able to help. If there's useful stuff we can put into the
bindings to make it more Python like, we can do that - we already have
the MSet (and similar) behave as a Python iterator so you can do:

for match in mset:
    # do stuff
    pass

for instance.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list