[Xapian-discuss] Quickest way to retrieve data for a large match set?

Fri Jun 25 14:58:41 BST 2010

On Friday 25 June 2010 13:51:04 Olly Betts wrote:

> It would be more efficient to tell Xapian about the contributions from age
> and popularity so it can produce the ranking you actually want.
> 
> You can do this in Xapian 1.2 by subclassing Xapian::PostingSource which
> allows extra weight contributions to be added in dynamically, but this
> isn't yet wrapped for use from Perl.

The requirement is to use the ratio of each result's age relative to *the 
newest one in the result set* rather than in the whole database, hence we 
can't actually determine the values until we have the whole set. This is the 
problem, sorry.

I have proposed relaxing this slightly, since using OP_SCALE_WEIGHT along with 
a weighted term in the database would allow something almost the same as what 
we're doing; however if there are many results with little separation between 
relevance values then the tweaks for age / popularity swamp the contribution 
from the search criteria (somewhat subjective, I guess, but still, I'm not the 
one signing this off ;) ).

> I wouldn't recommend this approach since by insisting on Xapian finding
> all the matches, you're hampering the optimisations it can use.  That's
> why we added Xapian::PostingSource - it allows you to perform the
> equivalent of many post-processing tricks during the match.

I appreciate it looks like a dumb thing to do (although it may allow us to 
cache search results for paging through, ultimately saves time if they look at 
more than a couple of pages in the result set). On the other hand we need 
accurate page count to generate the navigation links, so I'd be searching for 
the complete result set even if we still didn't fetch the document data for 
all of them.

> But right now, that doesn't help if you want to use Perl.
> 
> If you're just putting an external numeric id in the document data, you
> could use that as the document id for Xapian instead, which would avoid
> the need to call get_document() and get_data() for every match.  Or
> if you are looking up the popularity in an external database, then you
> could key that lookup on Xapian's docid.

Well, we're currently storing it in the document data, as part of a frozen 
perl hash. What I don't need is the document IDs - just the data (which 
includes our catalogue number, as that's used in the results page, and so we 
can actually avoid any database queries when we display the results).

I was hoping there might be a fast way of just getting the data :)

> Cheers,
>     Olly