[Xapian-discuss] Optimization and Load balancing with Xapian

Olly Betts olly at survex.com
Mon Feb 20 12:55:40 GMT 2006


On Mon, Feb 20, 2006 at 12:05:23PM +0200, David Levy wrote:
> > But sorting as currently designed does need to process every matching
> > document, which is going to be slow for a large database if the query
> > matches a lot of documents.
> 
> Will this mecanism change in future releases ?

It's possible there's a better way to handle it.  If we came up with a
workable scheme and somebody implemented it then we'd have a different
mechanism.  So it might change, but it's not something I'm currently
working on or actively planning to.

The problem is that you really want to process the documents in sorted
order, as you can then just stop once you've filled the MSet.  You could
list the document ids in ranked order for each sortable value (it would
take a fair amount of space), but then all the posting lists
list documents in id order, so you can't easily process documents in
sorted order even though you would then know that order.  You could
try to visit the docids in the order by random-access like seeking
into posting lists.  That would work OK if the top N items all made
it into the MSet, but at some point it'll become less efficient...

But it looks like this isn't currently the bottleneck.

> I have compacted and removed large fields in the index. So the database is
> half the size ... but performance are still slow.
> I am thinking about using "ramdisks" maybe; and I am checking my hard disks
> too.
> Did you used ramdisks with Xapian yet ? Does it help ?

The VM system in a modern Unix-like OS will cache blocks recently read
from disk.  This dynamic caching is probably going to do as well as
trying to force parts of the database into RAM.  By all means give it
a try, but I doubt it's a magic bullet.

> > But even now, "sort by date" is still acceptably fast on 30 million
> > documents, which points the finger strongly towards accessing the values
> > as taking most of the time.
> 
> How was do you mean ?
> I was bad results with < 1M documents  :

I mean "sort by date" is acceptably fast *on gmane*, which doesn't use
sorting on values, but still has to trawl through the whole of each
posting list in this case.  That strongly suggests that the bottleneck
is currently with getting at the values to do the sorting.

> However, I used the "collapse" parameter .. Is it time consuming even it
> there are no records to collapse in the results ?

Collapsing still needs to read the values, even if they are unique.  So
if collapsing is also slow, that further points the finger at the
storage of the values.

Cheers,
    Olly



More information about the Xapian-discuss mailing list