[Xapian-discuss] Indexing speed benchmark - Xapian, Solr
Olly Betts
olly at survex.com
Sat Apr 18 04:42:11 BST 2009
On Sun, Apr 12, 2009 at 07:26:03AM -0700, Andy wrote:
> I came across this benchmark between Xapian & Solr:
>
> http://www.anur.ag/blog/2009/03/xapian-and-solr/
Note that this is mis-titled - it is really a benchmark between Xappy
and Solr, though I don't know how much difference that makes. Richard
can probably comment more usefully.
It's good that it actually says what versions were used (many benchmarks
seem to fail to), but it's a shame that the benchmark code itself isn't
available - an experiment you can't independently reproduce isn't really
scientifically valid.
> According to the benchmark, a doc set that took Solr 34 min to index
> took Xapian 7 hours. Solr's index is also much smaller - 2.5GB to
> Xapian's 8.9GB.
That's not a fair size comparison. 2.5GB was the "optimized index size"
for Solr. The comparable figure for Xapian is the compacted size which
was 6.5GB.
> I'm new to Xapian. Just wondering if results like these are typical?
> Is indexing speed & size a known issue in Xapian? Or is there some
> other explanation for the big difference between the Solr & Xapian
> results?
Regarding the indexing time, by default Xapian auto-commits every 10000
documents, which is pretty conservative on modern hardware. The article
doesn't mention tuning this (by setting XAPIAN_FLUSH_THRESHOLD) so I
assume he didn't. If you have plenty of RAM, increasing that will speed
up indexing a lot. I'd imagine on the hardware described you could
index all million documents in one go, especially since they are
truncated to 2000 characters which is really short. And if you index in
one go, the database shouldn't need compacting either.
Ideally the flush size should probably adjust itself, but nobody has
done any work on that so far.
But it's true that more effort has been put in to search speed than
indexing speed so far, so there's likely to be a lot of potential for
making indexing faster.
Database size is something we have been working on a bit, and the new
chert backend which will debut shortly in the 1.1.x development series
will give smaller databases (especially the postlist table). It'll
still be larger than Solr in this case though.
If I understand correctly how Lucene handles document deletion, one
big difference between Lucene and Xapian is that Xapian stores the
list of terms indexed by each document which allows it to perform
a "perfect" delete, while Lucene doesn't store this information
and can only flag the document as deleted which means that the stats
won't get updated to reflect this change right away:
http://wiki.apache.org/lucene-java/LuceneFAQ#head-9475b5b51f7ca022e03dbd94cb82b4a6c02e3675
Once a document is deleted it will not appear in TermDocs nor
TermPositions enumerations, nor any search results. Attempts to load
the document will result in an exception. The presence of this
document may still be reflected in the docFreq statistics, and thus
alter search scores, though this will be corrected eventually as
segments containing deletions are merged.
While having stale stats is not ideal, the size of the termlist table is
quite a price to pay for "perfect" deletion if you don't need it for
other reasons, and in some situations you never need to delete documents
anyway. We're intending to allow the termlist table to be optional,
probably during the 1.1 development series:
http://trac.xapian.org/ticket/181
Once that's done, allowing "imperfect" deletion would be fairly easy.
To give an idea how much difference that would make, Gmane's index
(running on the new chert backend) is 130GB of which the termlist table
is 62GB. Gmane doesn't currently index positional data - if it did I
guess the database would be roughly twice as large, but that's still
about a 25% space saving if the termlist table were removed.
Cheers,
Olly
More information about the Xapian-discuss
mailing list