[Xapian-discuss] Searching date range on a custom field

Matt Barnicle mattb at wageslavery.org
Fri Jun 1 03:34:28 BST 2007


Olly Betts wrote:
> On Thu, May 31, 2007 at 03:12:18PM -0700, Matt Barnicle wrote:
>   
>> I'm using htdig to crawl the site, and htdig2omega to create the index.  
>> The index creation and field mapping works just fine, and so does 
>> searching on the boolean page type.  Here is my htdig2omega.script file:
>>
>> url : field=url hash boolean=Q unique=Q
>> title : weight=3 index truncate=80 field=title
>> lastMod : field=lastmod
>> size : field=size
>> sample : index truncate=300 field=sample
>> metaDesc : field=metadesc index
>> pageType : field=pageType boolean=XPT
>> eventName : field=eventName weight=3 index
>> dateBegin : field=dateBegin date=yyyymmdd
>> dateEnd : field=dateEnd date=yyyymmdd
>>     
>
> The scriptindex "date" action is designed to allow you to do date range
> filtering when each document has a single date, so this won't really
> work.
>
> You could make it work if you ran the date action on every date in the
> range, but if your ranges are long, that's going to generate a lot of
> terms.
>   
You know..  That might actually work out.  I'll do some investigation on 
our data and see what is the distribution of number of active days for 
an event...  Though I'm kicking around another solution in my head at 
the moment (see the end of the email)..
>> I found some posts from the list archives that discuss date ranges, but 
>> I can't figure out if they will help me in this situation or not..  I 
>> think they're talking about searching on date ranges on indexed 
>> documents, that is, the date the document was indexed.
>>     
>
> Yes, they are.
>
> What I'd suggest you do is to put the dateBegin and dateEnd into
> document values, so you can access them quickly during the match
> process.  For example:
>
>   dateBegin : field=dateBegin value=0
>   dateEnd : field=dateEnd value=1
>
> And then write a little MatchDecider subclass which checks takes
> a date and checks if a document's date range includes it.  Something
> like this totally untested code:
>
> class DateRangeMatchDecider : public Xapian::MatchDecider {
>     string date;
>
>   public:
>     DateRangeMatchDecider(const string & date_)
> 	: date(date_) { }
>
>     bool operator()(const Xapian::Document &doc) const {
> 	return doc.get_value(0) <= date && date <= doc.get_value(1);
>     }
> };
>
> (You might want to swap the order of the checks, depending whether you
> expect user dates are more likely to fall before or after events in
> the database.)
>
> Then you can instantiate this class with the date the user wants to
> search for and pass it to Enquire::get_mset().  You'll also want to
> OP_FILTER with XPTevent to only consider events.
>
> If you want the user to be able to search for any event happening within
> a range of dates, you can easily extend the above class to take a pair
> of dates and check if it overlaps with the document's range.
>
> Cheers,
>     Olly
>   
Hmm..  Well this looks like a great solution, however, I'm using the PHP 
bindings, and from what I've read, you can't subclass the MatchDecider 
in PHP..  So right now I'm thinking of another home-brewed solution to 
this.  Perhaps I could change the way the dates are presented in the 
meta tags, and create boolean values for each year that the event is 
active, and also for each month, and still have the
the full date strings in there also, but just store them as normal 
fields.  So the meta tags look like this:

<meta name="activeYear" content="2007" />
<meta name="activeMonth" content="02" />
<meta name="activeMonth" content="03" />
<meta name="dateBegin" content="20070217" />
<meta name="dateEnd" content="20070310" />

Then I could convert the active* tags to boolean values, and create an 
OP_FILTER for the user selected month and year (e.g. 20070225) as 
XAY=2007, XAM=02.  Then I could run through all the results and read the 
dateBegin and dateEnd fields, doing a simple comparison in PHP to 
determine if the record matches the day range, and output only those 
records that do.  There aren't *that* many events that are running in a 
month's time, looks like 400 or so from April 2007.  A worthy amount of 
data to be processing on every date search request though I suppose..  
I'll consider it some more.  But, would that work?

I guess my other option would be to patch the source and add the 
subclass to it, but then I would have to work with SWIG to add that as 
an available PHP class..  And that falls a little bit out of my comfort 
zone at the moment.

- Matt



More information about the Xapian-discuss mailing list