[Xapian-discuss] QueryParser stemming

Olly Betts olly at survex.com
Mon Jun 13 19:12:02 BST 2005


On Mon, Jun 13, 2005 at 04:41:49PM +0100, Tim Brody wrote:
> The first query anybody gives a citation index is their own name - I 
> want to get that search right (e.g. if someone enters 'Hawking' I want 
> to first list papers by Stephen Hawking, then all papers that contain 
> 'hawk' as a term). It's not difficult to maintain an author name 
> vocabulary to pre-fetch from.

OK, so when you see "Brody" you look in your author name list and see
it's an author's surname and think "author:Brody" instead?  Or perhaps
"(author:Brody OR Brody)", since names like "Black" could equally well
be part of a title.

If you use the "R" prefix for raw terms like Omega does, then "Brody",
"Hawking", or "Black" searches for the unstemmed form anyway.  Hmm,
though a search for "hawk" would match a document by "hawking", or
alternatively (if you didn't generate stemmed author terms) a search for
"hawking" wouldn't match a document by "Hawking".

> >As for not wanting the same stemming strategy for all fields,
> >QueryParser::add_prefix() should probably take a stem_strategy argument
> >which overrides the main setting.
> 
> I think this is the only way to achieve what I want (from Perl anyway). 
> An alternative would be to call Stem with the current prefix which would 
> provide complete flexibility.

But then the stemmer would need to be configurable by prefix too.

> It would be useful to be able to manipulate a query after it's been 
> built by the QP. A simple thing to expose might be the serialisation - 
> stored queries and all that!

I can see a use for being able to serialise a Query and unserialise it
again later.

But there's a big difference serialising to an opaque blob of data and
serialising to something which can sanely be manipulated.  The current
serialisation used internally is designed to be fairly compact, while
still simple and efficient to pack and unpack.

And a Query object is currently immutable once constructed, so adding
methods to manipulate it would require a copy-on-write mechanism to
be implemented (or copy-on-copy which is easier to implement but
expensive).

Perhaps a better approach would be to have a callback type mechanism
in the QueryParser - call a virtual method which can manipulate a
freshly tokenised term prior to it being added into the Query?

> Can overload '<>' in a new class that contains the begin and end.
> 
> This would allow:
> 
> while(defined(my $term = <$it>)) {
> }

Ah yes, this is a better match for what the STL calls an input_iterator.

And it also would allow use of skip_to to start iterating from a
particular point.

> Array overloading '@{}' the same class would provide list-access 
> (complete termlist would go into memory):

Neat.

Cheers,
    Olly



More information about the Xapian-discuss mailing list