[Xapian-discuss] Sorting and comparing
Olly Betts
olly at survex.com
Tue Sep 13 01:12:57 BST 2005
On Mon, Sep 12, 2005 at 11:13:22PM +0200, Floris Bos wrote:
> I've got a few more things that I can't figure out when I look at the
> documents. In the documentation the SORT cgi parameter is explained. The
> document tells that by default a string comparison is used but I want to
> sort on a date. Can I alter the comparison mode somehow?
No, but if your dates are stored as YYYYMMDD then a string comparison
will give exactly the same results as a numeric one (well, for the next
7994 years or so). Similarly you can pad other numeric values with
leading zeros.
User specified sort ordering is planned but requires a fair bit of work
to implement.
> I'd also like to
> know if it's possible to specify whether to sort ascending or descending.
> I'd like to be able to sort both ways.
The core library supports this as of 0.9.0, but Omega hasn't caught up
yet. Try the attached patch which should apply to 0.9.2 (it compiles
but is untested currently).
> Further I'd like to know if I can specify a field name for the date range
> restriction I want to add to my search query. I've looked at START END and
> SPAN but I don't see how I can use these parameters for a certain field.
> When I just pass: START=20050907&END=20050915 search results from before 7
> september are not returned (as you would expect) but I can't figure out
> which date field is used.
Omega uses terms with prefixes Y, M, and D to build a boolean expression
covering the required range of dates. D terms are for a single date and
are used at the ends of the range. M terms cover whole months, and Y
terms whole years. This scheme allows any range to be expressed using
a boolean expression without involving hundreds of terms. (Currently
omindex also generates W terms which cover a range of about 10 days, but
omega ignores these - we were going to profile if these actually helped
but never got around to it!)
If you index with omindex it uses the file modification time. And
scriptindex uses whichever field you index using the "date" command.
> The problem here is that I've got more than one
> date field specified in my documents in the search db and that I want to
> able to restrict my search using each of those date fields (not at the same
> time though). I tried this:
> STARTdatefieldname=20050907&ENDdatefieldname=20050915 but this doesn't work
> so I figure this is not the right way. So can I specify the a date field to
> restrict the search or isn't this possible at all (and maybe the insertion
> date in the db is used for date restriction?).
You'd need to create multiple D, M, and Y terms with different prefixes
(e.g. XDATEFIELDNAMED20050907, XDATEFIELDNAMEM200509, XDATEFIELDNAMEY2005)
in scriptindex and then support generating these in omega. I'll try to
look at doing this for the next release.
> Final question: can I use an integer comparison for a certain field? I've
> got a field in my search db that can be a number between 0 and 999. Now I'd
> like to be able to restrict my search (example) to those documents where
> this field has a value between 111 and 379 (and this could be totally
> different numbers the next time). Can this be done somehow and what would
> be the best way to do so? I'm starting to think that this kind of search
> restriction is not really what Xapian/Omega is meant for or am I wrong
> here? I thought so because this is neither a text search, nor a boolean
> search.
There's no explicit support for range searches as such at present, but
there are several ways to achieve the effect:
(1) Use a MatchDecider - you subclass this and provide a method which
says "yes" or "no" given a candidate document for the MSet during the
match:
http://www.xapian.org/docs/sourcedoc/html/classXapian_1_1MatchDecider.html
It's best to use a value to store the field you want to check as that's
much faster to read than the whole document data (and should be faster
still in Flint). You'll have to use a value anyway if you want to allow
sorting on the same quantity.
The main downside of this is that it typically impedes the matcher's
ability to give good estimates of the total number of matches (it has no
idea how many documents a MatchDecider will cull). Users will tend
to notice that the number of page links is unreliable, but you can make
use of the "check_at_least" parameter to get_mset() (which is MINHITS in
Omega) to counteract this.
(2) Generate terms similar to those Omega uses for date searching.
Omega's date searching scheme predates MatchDecider, but has the
advantage of allowing better statistic estimates. Essentially
date ranges aren't any different to other ranges of numbers. The
main downside here is that a lot of terms get created and increase
the size of the database.
(3) Depending on the nature of the range, you may be able to do
work at index time to avoid work at search time. On a movie review
site you might want to allow searching for movies which score
"more than X/10" or "less than Y/10" and it's quite feasible to
add suitable terms to each document. A movie scoring 3/10 would
be indexed by "more than 1/10" and "more than 2/10" and also
"less than 4/10", "less than 5/10", etc.
> I hope someone can help me out once more. By the way, I'm really satisfied
> with the performance of Omega/Xapian. On the fly inserting of new records
> using script input files works great for me.
Cool!
Cheers,
Olly
-------------- next part --------------
Index: docs/cgiparams.txt
===================================================================
--- docs/cgiparams.txt (revision 6359)
+++ docs/cgiparams.txt (working copy)
@@ -116,12 +116,10 @@
SORT
reorder results by this value number (greater values are better).
- By default the comparison is a string compare.
+ The comparison used is a string compare.
-SORTBANDS
- reorder results by the specified sort key within this many equal
- width bands of percentage relevance. So if this is 5, the bands
- are 100-80, 80-60, 60-40, 40-20, and 20-0.
+SORTREVERSE
+ if non-zero, reverse the sort order so that lower values are better.
Display parameters and navigation
---------------------------------
Index: omega.h
===================================================================
--- omega.h (revision 6359)
+++ omega.h (working copy)
@@ -60,7 +60,7 @@
extern bool sort_numeric;
extern Xapian::valueno sort_key;
-extern int sort_bands;
+extern bool sort_ascending;
extern Xapian::valueno collapse_key;
extern bool collapse;
Index: omega.cc
===================================================================
--- omega.cc (revision 6359)
+++ omega.cc (working copy)
@@ -73,9 +73,9 @@
// percentage cut-off
int threshold = 0;
-bool sort_numeric = true;
-Xapian::valueno sort_key = 0;
-int sort_bands = 0; // Don't sort
+bool sort_numeric = false;
+Xapian::valueno sort_key = Xapian::valueno(-1);
+bool sort_ascending = true;
Xapian::valueno collapse_key = 0;
bool collapse = false;
@@ -317,18 +317,17 @@
if (val != cgi_params.end()) {
const string & v = val->second;
if (v[0] == '#') {
+ // FIXME not supported currently!
sort_numeric = true;
sort_key = atoi(v.c_str() + 1);
} else {
sort_key = atoi(v.c_str());
}
- sort_bands = 1; // sorting is off unless this is set
- val = cgi_params.find("SORTBANDS");
+ val = cgi_params.find("SORTREVERSE");
if (val != cgi_params.end()) {
- sort_bands = atoi(val->second.c_str());
- if (sort_bands <= 0) sort_bands = 1;
+ sort_ascending = (atoi(val->second.c_str()) == 0);
}
- // FIXME: add SORT and SORTBANDS to filters too! But in a compatible
+ // FIXME: add SORT and SORTREVERSE to filters too! But in a compatible
// way ideally...
}
Index: query.cc
===================================================================
--- query.cc (revision 6359)
+++ query.cc (working copy)
@@ -307,8 +307,6 @@
if (enquire && error_msg.empty()) {
enquire->set_cutoff(threshold);
- // match_min_hits will be moved into matcher soon
- // enquire->set_min_hits(min_hits); or similar...
// Temporary bodge to allow experimentation with Xapian::BiasFunctor
MCI i;
@@ -322,16 +320,13 @@
}
enquire->set_bias(bias_weight, half_life);
}
- if (sort_bands) {
- enquire->set_sorting(sort_key, sort_bands);
- // FIXME: ignore sort_numeric for now
+ if (sort_key != Xapian::valueno(-1)) {
+ enquire->set_sort_by_value_then_relevance(sort_key, sort_ascending);
}
if (collapse) {
enquire->set_collapse_key(collapse_key);
}
- // FIXME - set msetcmp to reverse?
-
#ifdef HAVE_GETTIMEOFDAY
struct timeval tv;
if (gettimeofday(&tv, 0) == 0) {
More information about the Xapian-discuss
mailing list