[Xapian-discuss] How many docs to feed to an RSet?
Matthew Somerville
matthew at mysociety.org
Wed Mar 5 10:55:30 GMT 2008
Olly Betts wrote:
> We need to be able to see how this is going wrong. It's annoyingly
> fiddly to handle this case.
>
> Is the database small enough to give us a copy to prod?
Currently 8.1Gb on disc, I'm afraid. I could create a new one with only some
entries in it and see if I can provoke the same behaviour, I guess. Or give
someone a log in to the server :)
> Otherwise, it would be interesting to see the bounds - print out
> get_matches_upper_bound() and get_matches_lower_bound() as well as
> get_matches_estimated() for each N.
Okay, see below. I believe I have worked something out, or at least spotted
something...
>> As I increase the first argument to get_mset() it eventually starts
>> returning the right result.
>
> Did you mean third rather than first here?
No, the result gets worse as I increase the third argument until it reaches
up to the 510 I quoted, but if I ask for results starting from e.g. result
400 (so a larger first argument), the numbers are accurate.
Below is a table of results I'm currently seeing for a query for the word
"elephant" (stemmed, so the Query is Zeleph) with different arguments to
get_mset().
-----
*Added after researching the below table*
Something seems to "flip" when checkatleast goes above a certain number. I
called get_mset(0, 20, N) for every N between 0 and 1000 - the correct
number of results, as below, is 463. As you would expect, as N increased,
the results started getting more accurate, with the lower bound rising and
the upper bound falling, until the following happened (estimate/lower/upper):
get_mset(0, 20, 500): 465 455 467
get_mset(0, 20, 501): 463 455 465
get_mset(0, 20, 502): 463 456 464
get_mset(0, 20, 503): 463 457 464
get_mset(0, 20, 504): 463 458 464
get_mset(0, 20, 505): 463 459 464
get_mset(0, 20, 506): 464 460 464
get_mset(0, 20, 507): 464 461 464
get_mset(0, 20, 508): 464 462 464
get_mset(0, 20, 509): 463 462 463
get_mset(0, 20, 510): 463 463 463 <-- first time it gets it perfect
get_mset(0, 20, 511): 510 510 510 <-- goes wrong
get_mset(0, 20, 512): 510 510 510
get_mset(0, 20, 513): 510 510 510
get_mset(0, 20, 514): 510 510 510
get_mset(0, 20, 515): 510 510 510
(and every N bigger than that returned 510/510/510).
-----
Notes on table: The second argument to get_mset() is always 20. The correct
number of results with set_collapse_key(3) called is 463. If
set_collapse_key(3) is *not* called, every single estimated()/ lower()/
upper() in the below table is instead 594, the right result - ie. this only
appears to go wrong when set_collapse_key(3) has been called in this instance.
Sort by value is a number if set_sort_by_value(N) has been called.
Sort get_mset get_mset
by value 1st arg 3rd arg estimated() lower() upper()
----------------------------------------------------------------
- 0 0 432 85 562
- 0 100 453 125 555
- 0 300 461 283 512
- 0 500 465 455 467
- 0 1000 510 510 510
- 50 0 445 197 528
- 50 100 449 203 528
- 50 300 459 316 501
- 50 500 492 492 492 <-- worse than
when 1st
arg = 0 ?
- 50 1000 492 492 492
- 100 0 457 289 507
- 100 100 457 289 507
- 100 300 467 355 497
- 100 500 481 481 481
- 100 1000 481 481 481
- 200 0 461 396 479
- 200 100 461 396 479
- 200 300 462 405 478
- 200 500 472 472 472
- 200 1000 472 472 472
- 300 0 464 449 467
- 300 100 464 449 467
- 300 300 464 449 467
- 300 500 465 465 465
- 300 1000 465 465 465
- 400 0 463 463 463 <-- first time
- 400 100 463 463 463 it gets it
- 400 300 463 463 463 right at
- 400 500 463 463 463 all
- 400 1000 463 463 463
0 0 0 455 108 561
0 0 100 455 108 561
0 0 300 462 281 513
0 0 500 465 455 467
0 0 1000 510 510 510
0 50 0 455 108 561
0 50 100 455 108 561
0 50 300 462 281 513
0 50 500 465 455 467
0 50 1000 510 510 510
0 100 0 462 425 572
0 100 100 462 425 572
0 100 300 462 425 472
0 100 500 470 470 470
0 100 1000 470 470 470
0 200 0 461 443 466
0 200 100 461 443 466
0 200 300 461 443 466
0 200 500 466 466 466
0 200 1000 466 466 466
0 300 0 463 452 466
0 300 100 463 452 466
0 300 300 463 452 466
0 300 500 466 466 466
0 300 1000 466 466 466
0 400 0 463 459 464
0 400 100 463 459 464
0 400 300 463 459 464
0 400 500 464 464 464
0 400 1000 464 464 464
0 460 0 463 463 463
0 460 100 463 463 463
0 460 300 463 463 463
0 460 500 463 463 463
0 460 1000 463 463 463
ATB,
Matthew
More information about the Xapian-discuss
mailing list