[Xapian-discuss] Open source search engines compared
Arjen van der Meijden
acmmailing at tweakers.net
Sat Jul 11 08:43:51 BST 2009
On 10-7-2009 21:36 Kevin Duraj wrote:
> This article is misleading because the user is testing very small
> amount of data (90MB - 300MB). Most likely the article and test was
> done by someone who just recently learn about search engines or wants
> to make his blog popular. Average server now has 16GB of memory
> therefore even mediocre search engines as Lucene can show
> satifsfactory results when less data is used than avaiable memory on
> server.
That's just nonsense. Why should a database be large to be considered a
real database? We have a very large web-forum and have a database of
"only" about 19GB (compacted) for it. Your other post to the list
suggests that one shouldn't even use Xapian for small databases. Which
is an even sillier statement.
Apart from that, appareantly Lucene's database will fit in memory much
longer than Xapian's. Our 9GB text-corpus gets inflated to 25GB prior to
compaction in Xapian (18GB after), if Lucene makes such a database with
similar retrieval performance (the quality one, not the time) only
10GB... Its much easier to oversize the required server, and you can
still easily outperform (the time one) Xapian given your statement that
Lucene only works well if the database fits in memory...
> If the user would use least 100GB of data Lucene and many other open
> sources would be dead, where Xapian rock beyond 100GB of data with no
> problems. The year now is 2009 and we talking in Terabyte and
> Gigabytes not Megabytes. Who are these people writing these articles,
> confusing and misleading people? If you dealing with Megabytes of data
> MySQL is fine, you do not need to use search engine.
That's just not true. I'm actually working on an in-memory Lucene-index
of just 26MB lately to support searching through 130k productnames (35MB
of data). Currently we use a Xapian index for that, it indexes less
data, but that data is indexed different (stemmed and non-stemmed
versions plus ngrams) so its not a really fair comparison. Still, that
database is now 530MB (although compaction reduces that to 217MB).
But a MySQL-based like-search (or full-text-index which we can't use on
the table itself because we don't want MyISAM) would be no-way near as
powerfull as a Xapian or Lucene search.
By the way it is really hard to find a free large corpus of text that
can be used as a relatively fair benchmark between search-environments.
You'd need something like that TREC-corpus used, so you can compare the
relevance-score with the given expert-scores.
The larger corpa like these aren't very cheap to come by:
http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
If you want to go really large, you could obtain this
http://boston.lti.cs.cmu.edu/Data/clueweb09/ 5TB (compressed) dataset.
But than you need to be able to store the database resulting from that
dataset...
But even the smaller multi-gigabyte corpa require in-depth understanding
of the various search-engines to be able to do a fair comparison, and
several hours to actually index the data (our new server took 4 hours to
index 9GB of data in Xapian). So I can understand that someone chooses
to do the comparison with much smaller datasets, most people don't have
multi-gigabyte datasets anyway.
> Here is proof how fast search goes on 500GB of data using Xapian, can
> Lucene do that on single server? ... of course not.
> http://myhealthcare.com
Have you actually tried? I haven't, so I don't know the answer to that.
Although I'm probably going to try Lucene or Solr for our 9GB corpus
sometime soon. Mainly because we'd like some adjustments to the current
searching in our forum, that isn't yet supported in Xapian and with our
lack of C++-experience is quite hard to implement well.
Btw, these guys actually seem to have indexed the GOV2 corpus (400+GB)
with Lucene:
http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
So it is at least possible, whether the search-time was any good is a
bit harder to decipher.
Best regards,
Arjen
> PS: When blind leading the blind, at once they all fall off the cliff.
> This is Information Technology not Banking Industry, we know the
> mathematics.
>
> Thanks,
> Kevin Duraj
> http://myhealthcare.com
>
>
> On Mon, Jul 6, 2009 at 7:19 AM, Charlie Hull<charlie at juggler.net> wrote:
>> Hi all,
>>
>> You may find
>> http://developers.slashdot.org/story/09/07/06/131243/Open-Source-Search-Engine-Benchmarks
>> interesting.
>>
>> Xapian was rather slated for large index sixes and slow indexing, but
>> had comparable search performance to Lucene.
>>
>> Cheers
>>
>> Charlie
>>
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
More information about the Xapian-discuss
mailing list