[Xapian-discuss] Quickest way to retrieve data for a large match set?

Fri Jun 25 13:51:04 BST 2010

On Thu, Jun 24, 2010 at 12:55:09PM +0100, William Crawford wrote:
> We're using the Perl binding to access Xapian in a simple search of image 
> metadata (title and keywords). Due to the specification for the search engine, 
> by default we have to sort the results using a function of the search rank, 
> age (well, newness) and popularity (rated by sales of the image). As a result, 
> we have to fetch the complete result set and then calculate a new ranking 
> based on the original rank, perturbed using the ratios of each of the newness 
> and popularity to the highest values in the result set (i.e. there is no way 
> to precalculate these at indexing time, alas).

It would be more efficient to tell Xapian about the contributions from age and
popularity so it can produce the ranking you actually want.

You can do this in Xapian 1.2 by subclassing Xapian::PostingSource which allows
extra weight contributions to be added in dynamically, but this isn't yet
wrapped for use from Perl.

> Currently fetching the document data for the results has become something of a 
> bottleneck (typical searches my generate 50 - 500 matches, but some return 
> more than 5000).
> 
> Code is something like:
> 
> ...
>     print STDERR "Query = ", $q->get_description, "\n" if $self->debug;
>     my $e = $self->index->enquire ($q);
>     #my $hits = $e->get_mset(0, $self->index->get_doccount, $self->index-
> >get_doccount);
>     my (@hits) = $e->matches (0, $self->index->get_doccount, $self->index-
> >get_doccount);
>     my (@results) = map +thaw($_->get_document->get_data), @hits;
>     return \@results;
> }
> 
> I'd like to know if there's anything I can do to improve the speed of fetching 
> the results (in other words, am I doing it wrong)?

I wouldn't recommend this approach since by insisting on Xapian finding
all the matches, you're hampering the optimisations it can use.  That's
why we added Xapian::PostingSource - it allows you to perform the
equivalent of many post-processing tricks during the match.

But right now, that doesn't help if you want to use Perl.

If you're just putting an external numeric id in the document data, you
could use that as the document id for Xapian instead, which would avoid
the need to call get_document() and get_data() for every match.  Or
if you are looking up the popularity in an external database, then you
could key that lookup on Xapian's docid.

Cheers,
    Olly