[Xapian-discuss] stub-file and get_doccount
Olly Betts
olly at survex.com
Thu Mar 12 22:44:39 GMT 2015
On Wed, Mar 11, 2015 at 07:01:48PM +0100, QE :: Felix Ostmann wrote:
> i switched from one big index to a stub file with many indexes and running
> into a problem.
>
> i have a tool to fetch a random document via:
>
> get_doccount
> random id up to get_doccount
> get_document with that id
>
> after changing to stub file this failes. Is there a nice way to get a
> random document from a stub file?
Note that the above only works with a single database if you've never
deleted any documents.
With multiple databases, the document ids are interleaved - see here for
details of how:
http://trac.xapian.org/wiki/FAQ/MultiDatabaseDocumentID
This is done so that the numbering for is stable when documents are
added to the individual databases.
So unless all the databases have equal numbers of documents (or some
have one fewer and they are arranged suitably), you'll end up with gaps
in the numbering at the upper end.
One option is to pick a random id up to get_lastdocid(), and retry if
DocNotFoundError is thrown. That may be inefficient if get_lastdocid()
is much larger than get_doccount().
To avoid the exceptions, I think you'll need to pick a subdatabase and
then a document within that. If you aren't fussy about how even the
random distribution is, you could pick 1 out of N subdatabases at
random, and then randomly pick a docid within that subdatabase.
Otherwise you'll want to pick the subdatabases with probability
proportional to the number of documents they contain.
Cheers,
Olly
More information about the Xapian-discuss
mailing list