[Xapian-discuss] Xapian and quartz scalability - feedback of current users

Tue Mar 22 11:52:37 GMT 2005

On Sun, Mar 20, 2005 at 12:42:42PM +0100, Alexandre Dulaunoy wrote:
> We would like to make some test with Xapian and the quartz backend on
> a large set of sample test document (around 50 millions for starting)

I'm very interested to hear reports of such tests.  I've done some
myself, but there's a danger that tuning which helps one situation
hinders others.

You should be aware that I'm in the process of overhauling quartz.

My plan is to clone the quartz backend once we've moved to SVN (which
should be in the next week or two), then replace parts of it.  So
quartz won't be destabilised, and the new database format can be fluid
initially without annoying people trying to actually use Xapian!

I've done much of the design now, though most is on paper or in my
head - I need to type it up so others can take a look.  A few things
re already implemented (e.g. there's a patch for zlib compression)
and I've already folded some simple compatible changes into quartz in
CVS (so in 0.9.0 databases will be more compact both before and after
quartzcompact).

But this actually means benchmarking quartz would be very useful at this
point.  It gives a baseline, and we can then track how things change
(hopefully for the better) as development progresses.

> quartz backend seems very flexible and the document scalability on
> the web site (http://www.xapian.org/docs/scalability.html) is talking
> of a possible way to implement a kind of cluster for concurrent search
> indexing and asynchronous updating. We were wondering if there is any
> users of quartz usting a clustering approach in the list.

Webtop used a system sort of like this, but sadly it's the source isn't
open.  They actually used the muscat36 backend (it was either pre-quartz
or quartz was still rather experimental - I don't recall which).  But
the system would look pretty similar anyway.

If you use quartzcompact's new merge facility (in CVS only currently),
then you can build many databases of (say) a few million documents in
parallel without much need for synchronisation - just partition the job
and wait for them all to finish.  Then you merge the built databases
together - either all at once, or N at a time in parallel until you have
just one.  I've not experimented with quartzcompact merging benchmarking
yet - I merged about 43 databases with just under 500,000 documents each
for gmane in one pass, and it coped pretty well.

> What is(are) the classical design ?

N-way merging to produce the inverted file is textbook stuff.  It's
really just the old "sorting and merging from external store" approach,
which is never really obsoleted by faster computers with larger memory -
the dataset size where you start to need it just rises too.

As a general point, you want to try to design such that the indexing
processes don't need to communicate (ideally at all, though one way
async communication is pretty harmless - e.g. a web crawling process
generating URLs from links and spooling them to a file).

> Based on what the separation of the quartz databases is made ?

If you're searching over several unmerged databases, try to make them
all a representative sample of the whole corpus as Xapian by default
approximates term frequencies by looking at those in one database (the
first I think, but check to be sure!)  This is for efficiency.

> How is the updating handled to provide continuous services ?

You can search a database which is being updated, but if updates are
being flushed at a frequency such that a search may span more than
one flush, searches may be forced to restart (something the quartz
overhaul should fix).

If you aren't so concerned with new content being searchable right
away, it's simpler to build the database, run it through quartzcompact
and then add it to those searched.

> How is the cluster organized ? How are you dealing with
> unresponsive systems part of the cluster ?

Xapian::ErrorHandler allows some control of this.  Webtop used it
and seemed reasonably happy with it I think, but there may be a better
approach.  If a system goes completely unresponsive, you probably don't
want to keep waiting for timeouts from it...

> Is there any other free
> software components available for job allocation (updating, compacting
> and alike)  inside a quartz/xapian cluster ?

I suspect people just roll their own with Python or perl or similar.
It would be good to include some sample scripts at least if anyone
has some.

> Is there any technical
> comparison between Nutch/Lucene and Xapian/Quartz regarding large
> scale index ?

Not that I know of.

Divmod switched from Lucene to Xapian, and the only negative comment was
that Xapian databases are larger.  *If* the working set is also larger
(it's not at all obvious if it would be or not), that means we'll scale
less well once everything is I/O bound.

But the quartz overhaul should reduce the database sizes quite
substantially as well as reducing the working set size.

> Thanks a lot for any feedback,

No problem.

Cheers,
    Olly