[Xapian-discuss] Re: searching and sorting by date

Michel Pelletier michel at dialnetwork.com
Wed Mar 22 17:45:02 GMT 2006


Sungsoo Kim wrote:
>> We have a use case where we must return the first 50 most recent 
>> documents that match our query.  We don't want the first 50 matches to 
>> the query that are then sorted by date.  I hope the distinction is 
>> clear enough.  What we are unsure of from reading the documents is if 
>> setting a sort value on our query (enq.set_sort_by_value()) will 
>> return the first 50 documents that match the query, or the first 50 
>> matches, then sorted by that value.
> 
> 
> 
> Xapian will perform query first with given query terms and then it will 
> sort by value. It will sort the whole result, so you don't worry if the 
> results are only first 50 matches.

I'm not sure I parse that, so I figured I'd elaborate my question.

Sungsoo, thank you for your post and also to James Aylett and Richard 
for their help as well.  Hopefully I understand the various answer but 
just in case I'd like to present a use case that is more concrete than 
my original question.

We are changing a legacy application that was built entirely on MySQL. 
We'd like to remove the searching role from MySQL and only use it for 
document storage, and instead use xapian as the IR system.  We think 
Xapian is well suited for this task as we've researched it so far.

Here is a simplification of the SQL query that I'm trying to replicate 
with xapian:

SELECT ad.id FROM ad WHERE ad.subcategory_id=41 ORDER BY last_posted 
DESC LIMIT 0,51;

is this essentially the same in Xapian as querying for subcategory_id = 
41 and then using last_posted as the sortKey?  And, like this SQL query, 
can Xapian return the 50 must recent documents whose subcategory_id == 41?

I'm a little mixed up on this, partially because we are using Xapwrap 
which is an early product without much documentation (but still quite 
useful so far).  Perhaps I should ask this question on the divmod lists 
instead.

I also have another side question if someone has the time,  I've read 
that xapian documents have postings, data, terms, and values.  I 
understand the definitions from xapian about these four things, but the 
existing documentation gives only an example of querying with terms, and 
we'd like to query by value. In other words, the equivalent to "WHERE 
ad.subcategory_id=41".  The breif query parser docs say:

"""
Searching within a probabilistic field

If the database has been indexed with prefixes on probabilistic terms 
from certain fields, you can set up a prefix map so that the user can 
search within those fields. For example author:dickens title:shop might 
find documents by dickens with shop in the title. You can also specify a 
prefix on a quoted phrase or on a bracketed expression.
"""

But I haven't been able to find any further explanation, particularly, 
what "indexing with prefexes on probablistic terms from certain fields" 
means.  I tried querying for "subcateogry_id:41" but got no success.

Again thanks for your help!  If anyone sheds light on how to use values 
in queries I'd be happy to send a patch to the documentation so that 
that can be written out for others to use.

> 
> When query terms are blank it will return nothing, it means MSet would 
> be empty. Therefore you must give something as query terms. I guess this 
> is the main concern to you in the view point of general RDBMS, because 
> in RDBMS we can get all or only first 50 rows in date order.
> 
> You can put date to the term list such as 'D20060322' when indexing 
> documents, and simultaneously put the date into value. And you can 
> search by 'D2006*' or 'D200603*' and use set_sort_by_value() to the date 
> value. If docid is in the same order as date, you don't have to use 
> set_sort_by_value().
> 
> My answer would be inappropriate, and you can get better idea from others.

Actually Sungsoo this technique you describe above is very cool, thanks 
a lot!

-Michel




More information about the Xapian-discuss mailing list