[Xapian-discuss] How many docs to feed to an RSet?

Matthew Somerville matthew at mysociety.org
Wed Mar 5 10:55:30 GMT 2008


Olly Betts wrote:
> We need to be able to see how this is going wrong.  It's annoyingly
> fiddly to handle this case.
> 
> Is the database small enough to give us a copy to prod?

Currently 8.1Gb on disc, I'm afraid. I could create a new one with only some 
entries in it and see if I can provoke the same behaviour, I guess. Or give 
someone a log in to the server :)

> Otherwise, it would be interesting to see the bounds - print out
> get_matches_upper_bound() and get_matches_lower_bound() as well as
> get_matches_estimated() for each N.

Okay, see below. I believe I have worked something out, or at least spotted 
something...

>> As I increase the first argument to get_mset() it eventually starts
>> returning the right result.
> 
> Did you mean third rather than first here?

No, the result gets worse as I increase the third argument until it reaches 
up to the 510 I quoted, but if I ask for results starting from e.g. result 
400 (so a larger first argument), the numbers are accurate.

Below is a table of results I'm currently seeing for a query for the word 
"elephant" (stemmed, so the Query is Zeleph) with different arguments to 
get_mset().

-----
*Added after researching the below table*

Something seems to "flip" when checkatleast goes above a certain number. I 
called get_mset(0, 20, N) for every N between 0 and 1000 - the correct 
number of results, as below, is 463. As you would expect, as N increased, 
the results started getting more accurate, with the lower bound rising and 
the upper bound falling, until the following happened (estimate/lower/upper):
     get_mset(0, 20, 500): 465 455 467
     get_mset(0, 20, 501): 463 455 465
     get_mset(0, 20, 502): 463 456 464
     get_mset(0, 20, 503): 463 457 464
     get_mset(0, 20, 504): 463 458 464
     get_mset(0, 20, 505): 463 459 464
     get_mset(0, 20, 506): 464 460 464
     get_mset(0, 20, 507): 464 461 464
     get_mset(0, 20, 508): 464 462 464
     get_mset(0, 20, 509): 463 462 463
     get_mset(0, 20, 510): 463 463 463 <-- first time it gets it perfect
     get_mset(0, 20, 511): 510 510 510 <-- goes wrong
     get_mset(0, 20, 512): 510 510 510
     get_mset(0, 20, 513): 510 510 510
     get_mset(0, 20, 514): 510 510 510
     get_mset(0, 20, 515): 510 510 510
(and every N bigger than that returned 510/510/510).

-----

Notes on table: The second argument to get_mset() is always 20. The correct 
number of results with set_collapse_key(3) called is 463. If 
set_collapse_key(3) is *not* called, every single estimated()/ lower()/ 
upper() in the below table is instead 594, the right result - ie. this only 
appears to go wrong when set_collapse_key(3) has been called in this instance.

Sort by value is a number if set_sort_by_value(N) has been called.

   Sort      get_mset    get_mset
by value    1st arg     3rd arg    estimated()  lower()  upper()
----------------------------------------------------------------
     -           0          0          432          85     562
     -           0         100         453         125     555
     -           0         300         461         283     512
     -           0         500         465         455     467
     -           0         1000        510         510     510

     -          50          0          445         197     528
     -          50         100         449         203     528
     -          50         300         459         316     501
     -          50         500         492         492     492 <-- worse than
                                                                   when 1st
                                                                   arg = 0 ?
     -          50         1000        492         492     492

     -         100          0          457         289     507
     -         100         100         457         289     507
     -         100         300         467         355     497
     -         100         500         481         481     481
     -         100         1000        481         481     481

     -         200          0          461         396     479
     -         200         100         461         396     479
     -         200         300         462         405     478
     -         200         500         472         472     472
     -         200         1000        472         472     472

     -         300          0          464         449     467
     -         300         100         464         449     467
     -         300         300         464         449     467
     -         300         500         465         465     465
     -         300         1000        465         465     465

     -         400          0          463         463     463 <-- first time
     -         400         100         463         463     463     it gets it
     -         400         300         463         463     463     right at
     -         400         500         463         463     463     all
     -         400         1000        463         463     463

     0           0          0          455         108     561
     0           0         100         455         108     561
     0           0         300         462         281     513
     0           0         500         465         455     467
     0           0         1000        510         510     510

     0          50          0          455         108     561
     0          50         100         455         108     561
     0          50         300         462         281     513
     0          50         500         465         455     467
     0          50         1000        510         510     510

     0         100          0          462         425     572
     0         100         100         462         425     572
     0         100         300         462         425     472
     0         100         500         470         470     470
     0         100         1000        470         470     470

     0         200          0          461         443     466
     0         200         100         461         443     466
     0         200         300         461         443     466
     0         200         500         466         466     466
     0         200         1000        466         466     466

     0         300          0          463         452     466
     0         300         100         463         452     466
     0         300         300         463         452     466
     0         300         500         466         466     466
     0         300         1000        466         466     466

     0         400          0          463         459     464
     0         400         100         463         459     464
     0         400         300         463         459     464
     0         400         500         464         464     464
     0         400         1000        464         464     464

     0         460          0          463         463     463
     0         460         100         463         463     463
     0         460         300         463         463     463
     0         460         500         463         463     463
     0         460         1000        463         463     463


ATB,
Matthew



More information about the Xapian-discuss mailing list