[Xapian-discuss] Need more explanations about Xapian's expanding

Tue Sep 30 06:13:43 BST 2008

On Fri, Sep 26, 2008 at 06:11:40PM +0200, Ivan Sutter wrote:
> First, I get the top 40 results :
> $matches = $enquire->get_mset(0, 40, $rset);
> That works well, even if 40 is often enough (I mean I often get less that 40
> results).
> 
> Then, I call this part (sorry for the stupid copy-paste) :
> // If no relevant docids were given, invent an RSet containing the top 5
> // matches (or all the matches if there are less than 5).
>    if ($rset->is_empty()) {
>     $c = 20; // so here I've put 20 instead of 5...
>     $i = $matches->begin();
>     while ($c-- && !$i->equals($matches->end())) {
>         $rset->add_document($i->get_docid());
>         $i->next();
>     }
>    }
> And in fact that's weird because my $rset is empty but it's called in the
> previous get_mset() ! I've missed something.
> 
> Finally, I'm getting the suggestions :
> $eset = $enquire->get_eset(10, $rset);
> 
> As you can see, I'm not mastering all these lines ... I just wish some help
> to know how these "ratios" (the 40, 20 and 5) are affecting the result.

Well, "40" is just the size of the MSet you've requested.

"20" is how many documents from the MSet you are adding to the RSet, and
"5" is how many the example you start from added.

I suspect 20 is too many - you want the RSet to contain genuinely
relevant documents.  Ideally the user would pick the relevant documents,
but you can often get reasonable results by assuming that the top few
entries from the MSet are relevant.  But the more you add, the more
likely that some won't actually be relevant - I would guess that 20 is
too high, especially if you are often getting less than 40 results in
total.

You could probably look at how the MSet weights vary to pick a cut-off
dynamically.  I've not done tests, but it seems likely that you don't
want to keep adding documents once the weights drop sharply.

I wonder if you meant "10" not "5"?  "10" is the number of terms you'd
like in the ESet.

> Don't worry, I've run tests, but according to the amount of data, it's hard
> to know if I've find a true good result or if it's just luck !
> So a "scientific" explanation would be grate !

I'm not sure "science" can automatically give you good values for the
number of documents to add to the auto-generated RSet and the number of
relevant terms to ask for.  You probably do want to run some tests to
empirically validate the numbers you're using.

Cheers,
    Olly