[Xapian-discuss] problem with multi-database search, xapian 0.9.10

Vasiliy Sergeev vasiliy.sergeev at sibers.com
Mon Feb 18 08:19:09 GMT 2008


Hi Olly,
> If you make use of more than half the docid space in each of the two
> subdatabases, there's not much we can do.  We need to map the docids
> from the subdatabases to/from the docids of the combined database.  So
> the "fix" would be to throw an exception in this case, which isn't going
> to help you much...
>
> I assume you don't actually have 4 billion documents in each database?
> If you do, then your only option is to recompile Xapian with a 64-bit
> Xapian::docid type.
>   
You made right assuming, I have no 4 billions of documents. Also I found 
strange thing: all the rest of month Databases starts with docid=1. But 
these problematic Databases starts with id like 4227584824. I detected 
this problem by using command
delve news200802 -r 1
Error: Can't read termlist for document 1: Not found
delve news200802 -r 1000
Error: Can't read termlist for document 1000: Not found
delve news200802 -r 1000000
Error: Can't read termlist for document 1000000: Not found
delve news200802 -r 1000000000
Error: Can't read termlist for document 1000000000: Not found
But
delve news200802 -r 4227584824
Term List for record #4227584824: here is a list of terms....
It seems for me that xapian decided to start January and February DBs 
from some very close to MAX_INT value. Is there any possible solution to 
shift them?
Can xapian utilities do such thing?
OR maybe there is a way for me to set first docid to 1. In this case 
will xapian increment docid from this manually set first docid?
> Although you can set your own docids to create a sparse usage pattern,
> it's probably not a good idea to.  The backend uses delta encoding on
> docids to compress posting lists, which means that the compression won't
> be as good.  You'll probably waste more space than you save by not
> storing the UID as a term, and less compressed posting lists affect all
> searches whereas the UID terms won't have much overhead at search time
> (since they'll all appear adjacently in the Btree).
>
>   
Could you please explain this part? I have to use non-compact Databases 
since they are updating every time news with the same content appears 
(news duplicate). So not only inserts happen but updates happen too. My 
problem with this docids is in xapian sorting functions. They ignore 
these documents so it seems like we have no news for February, only for  
January (if we search in both monthes together).

Thanks for help,
Vaso.



More information about the Xapian-discuss mailing list