[Xapian-discuss] stub-file and get_doccount

QE :: Felix Ostmann ostmann at qe.de
Fri Mar 13 18:09:26 GMT 2015


OK, after a short brainstorm i implemented the following:

I don't modify my indexe, i only build new ones.

While generating i save the doccount (same as lastdocid) and database in a
array in the metadata.
Also i save the absolute doccount over all databases.

Now i can get a random integer up to the absolute doccount and iterate over
the array and decrement the random integer if it is greater than the
doccount from the current database.

If the doccount is equal or smaller than the doccount from the current
database i can open this database and use get_document with the random
integer.

perfect random for me!

Thanks for your help!



Mit freundlichem Gruß
Felix Ostmann

-----------------------------------------------------------
QE GmbH & Co. KG, Martinistraße 3, D-49080 Osnabrück
-----------------------------------------------------------
Tel.: +49 (0) 541 / 40666 0, Fax: +49 (0) 541 / 40666 22
Email: info at qe.de, Web: www.qe.de
-----------------------------------------------------------
AG Osnabrück - HRA 200252, Ust-IdNr.: DE814737310
-----------------------------------------------------------
Komplementärin: QE24 GmbH, AG Osnabrück - HRB 200359,
Geschäftsführer: Ansas Meyer, Firmensitz: Osnabrück
-----------------------------------------------------------

Die in dieser Email enthaltenen Informationen sind vertrau-
lich zu behandeln und ausschließlich für den Adressaten be-
stimmt. Jegliche Veröffentlichung, Verteilung oder sonstige
in diesem Zusammenhang stehende Handlung wird ausdrücklich
untersagt.

2015-03-12 23:44 GMT+01:00 Olly Betts <olly at survex.com>:

> On Wed, Mar 11, 2015 at 07:01:48PM +0100, QE :: Felix Ostmann wrote:
> > i switched from one big index to a stub file with many indexes and
> running
> > into a problem.
> >
> > i have a tool to fetch a random document via:
> >
> > get_doccount
> > random id up to get_doccount
> > get_document with that id
> >
> > after changing to stub file this failes. Is there a nice way to get a
> > random document from a stub file?
>
> Note that the above only works with a single database if you've never
> deleted any documents.
>
> With multiple databases, the document ids are interleaved - see here for
> details of how:
>
> http://trac.xapian.org/wiki/FAQ/MultiDatabaseDocumentID
>
> This is done so that the numbering for is stable when documents are
> added to the individual databases.
>
> So unless all the databases have equal numbers of documents (or some
> have one fewer and they are arranged suitably), you'll end up with gaps
> in the numbering at the upper end.
>
> One option is to pick a random id up to get_lastdocid(), and retry if
> DocNotFoundError is thrown.  That may be inefficient if get_lastdocid()
> is much larger than get_doccount().
>
> To avoid the exceptions, I think you'll need to pick a subdatabase and
> then a document within that.  If you aren't fussy about how even the
> random distribution is, you could pick 1 out of N subdatabases at
> random, and then randomly pick a docid within that subdatabase.
> Otherwise you'll want to pick the subdatabases with probability
> proportional to the number of documents they contain.
>
> Cheers,
>     Olly
>


More information about the Xapian-discuss mailing list