[Xapian-discuss] QueryParser stemming

Tim Brody tdb01r at ecs.soton.ac.uk
Mon Jun 13 16:41:49 BST 2005


Olly Betts wrote:
> On Thu, Jun 09, 2005 at 11:57:54AM +0100, Tim Brody wrote:
> 
>>I'm considering expanding Xapian to cover all of my search fields:
>>authors => A
>>title => T [stemmed]
>>description => D [stemmed]
>>date => Y [range?]
>>(fulltext => F) [stemmed]
>>
>>I would like to allow users to specify a query e.g.
>>Brody impact analysis 2004
>>
>>If the user isn't explicit with prefixing I need to be able modify the 
>>query terms (e.g. 'Brody' is an author name) to apply stemming and 
>>prefixing as appropriate.
> 
> I don't follow how you know 'Brody' is meant to be an author name.
> Assuming all capitalised words are author names seems likely to
> frustrate anyone who doesn't read and memorise the help.

The first query anybody gives a citation index is their own name - I 
want to get that search right (e.g. if someone enters 'Hawking' I want 
to first list papers by Stephen Hawking, then all papers that contain 
'hawk' as a term). It's not difficult to maintain an author name 
vocabulary to pre-fetch from.

>>I don't think I can achieve this with the current Perl bindings. To do 
>>title OR description OR fulltext I need to iterate over the terms and 
>>add the appropriate prefix for each field. Similarly I will want to stem 
>>title/description terms, but leave author terms alone.
>>
>>So, is this feasible? Is there a better approach?
> 
> For searching over all fields, you can do the work at index time instead
> of search time (with the exception of the non-stemming), which is likely
> to give a faster search.  I'd probably recommend that approach.
 >
> So for the author, title, and description fields, you generate both the
> prefixed terms, and non-prefixed ones.  Except you need to stem the
> non-prefixed author terms then.  I don't see an easy way to avoid that.
 >
> As for not wanting the same stemming strategy for all fields,
> QueryParser::add_prefix() should probably take a stem_strategy argument
> which overrides the main setting.

I think this is the only way to achieve what I want (from Perl anyway). 
An alternative would be to call Stem with the current prefix which would 
provide complete flexibility.

>>Shall I start adding the Internals to Perl's bindings?
> 
> The interfaces to the Internals classes are subject to arbitrary change
> without notice.  It doesn't make sense to try to wrap them.
> 
> Anyway, the binding layer is the wrong place to add this in my view.  We
> don't really want to add generic functionality there - that belongs in
> the core library where it's accessible to all users.  Wrapping things in
> a way more natural to the language is fine - for example lazy lists
> instead of iterators.  That's inherently language specific.

It would be useful to be able to manipulate a query after it's been 
built by the QP. A simple thing to expose might be the serialisation - 
stored queries and all that!

>>(And what happened to my patches? :-)
> 
> 
> I'm working through them.  There are some changes which are good but I
> want to generalise.  So far I've made == and != work the same as 'eq'
> and 'ne' on all iterators, not just TermIterator - that's all applied
> and committed.  Also, being able to use Perl lists which wrap iterators
> should be available everywhere really.  I've stalled a little on that
> because we're really going to want lazy evaluation for some cases (e.g.
> Database::allterms) and I need to read up on how that's done.

Can overload '<>' in a new class that contains the begin and end.

This would allow:

while(defined(my $term = <$it>)) {
}

Array overloading '@{}' the same class would provide list-access 
(complete termlist would go into memory):

for(@$it) {
}

Mixing access methods would result in missing terms.

TIEing an array can't be implemented efficiently with iterators, and 
would need a class per iterator-use vs. one class per iterator above.

All the best,
Tim.



More information about the Xapian-discuss mailing list