[Xapian-discuss] Quickest way to retrieve data for a large match set?
William Crawford
william at sciencephoto.co.uk
Fri Jun 25 14:58:41 BST 2010
On Friday 25 June 2010 13:51:04 Olly Betts wrote:
> It would be more efficient to tell Xapian about the contributions from age
> and popularity so it can produce the ranking you actually want.
>
> You can do this in Xapian 1.2 by subclassing Xapian::PostingSource which
> allows extra weight contributions to be added in dynamically, but this
> isn't yet wrapped for use from Perl.
The requirement is to use the ratio of each result's age relative to *the
newest one in the result set* rather than in the whole database, hence we
can't actually determine the values until we have the whole set. This is the
problem, sorry.
I have proposed relaxing this slightly, since using OP_SCALE_WEIGHT along with
a weighted term in the database would allow something almost the same as what
we're doing; however if there are many results with little separation between
relevance values then the tweaks for age / popularity swamp the contribution
from the search criteria (somewhat subjective, I guess, but still, I'm not the
one signing this off ;) ).
> I wouldn't recommend this approach since by insisting on Xapian finding
> all the matches, you're hampering the optimisations it can use. That's
> why we added Xapian::PostingSource - it allows you to perform the
> equivalent of many post-processing tricks during the match.
I appreciate it looks like a dumb thing to do (although it may allow us to
cache search results for paging through, ultimately saves time if they look at
more than a couple of pages in the result set). On the other hand we need
accurate page count to generate the navigation links, so I'd be searching for
the complete result set even if we still didn't fetch the document data for
all of them.
> But right now, that doesn't help if you want to use Perl.
>
> If you're just putting an external numeric id in the document data, you
> could use that as the document id for Xapian instead, which would avoid
> the need to call get_document() and get_data() for every match. Or
> if you are looking up the popularity in an external database, then you
> could key that lookup on Xapian's docid.
Well, we're currently storing it in the document data, as part of a frozen
perl hash. What I don't need is the document IDs - just the data (which
includes our catalogue number, as that's used in the results page, and so we
can actually avoid any database queries when we display the results).
I was hoping there might be a fast way of just getting the data :)
> Cheers,
> Olly
More information about the Xapian-discuss
mailing list