[Xapian-discuss] Open source search engines compared

Sat Jul 11 08:43:51 BST 2009

On 10-7-2009 21:36 Kevin Duraj wrote:
> This article is misleading because the user is testing very small
> amount of data (90MB - 300MB). Most likely the article and test was
> done by someone who just recently learn about search engines or wants
> to make his blog popular. Average server now has 16GB of memory
> therefore even mediocre search engines as Lucene can show
> satifsfactory results when less data is used than avaiable memory on
> server.

That's just nonsense. Why should a database be large to be considered a 
real database? We have a very large web-forum and have a database of 
"only" about 19GB (compacted) for it. Your other post to the list 
suggests that one shouldn't even use Xapian for small databases. Which 
is an even sillier statement.
Apart from that, appareantly Lucene's database will fit in memory much 
longer than Xapian's. Our 9GB text-corpus gets inflated to 25GB prior to 
compaction in Xapian (18GB after), if Lucene makes such a database with 
similar retrieval performance (the quality one, not the time) only 
10GB... Its much easier to oversize the required server, and you can 
still easily outperform (the time one) Xapian given your statement that 
Lucene only works well if the database fits in memory...

> If the user would use least 100GB of data Lucene and many other open
> sources would be dead, where Xapian rock beyond 100GB of data with no
> problems. The year now is 2009 and we talking in Terabyte and
> Gigabytes not Megabytes. Who are these people writing these articles,
> confusing and misleading people? If you dealing with Megabytes of data
> MySQL is fine, you do not need to use search engine.

That's just not true. I'm actually working on an in-memory Lucene-index 
of just 26MB lately to support searching through 130k productnames (35MB 
of data). Currently we use a Xapian index for that, it indexes less 
data, but that data is indexed different (stemmed and non-stemmed 
versions plus ngrams) so its not a really fair comparison. Still, that 
database is now 530MB (although compaction reduces that to 217MB).
But a MySQL-based like-search (or full-text-index which we can't use on 
the table itself because we don't want MyISAM) would be no-way near as 
powerfull as a Xapian or Lucene search.

By the way it is really hard to find a free large corpus of text that 
can be used as a relatively fair benchmark between search-environments. 
You'd need something like that TREC-corpus used, so you can compare the 
relevance-score with the given expert-scores.
The larger corpa like these aren't very cheap to come by:
http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html

If you want to go really large, you could obtain this 
http://boston.lti.cs.cmu.edu/Data/clueweb09/ 5TB (compressed) dataset. 
But than you need to be able to store the database resulting from that 
dataset...

But even the smaller multi-gigabyte corpa require in-depth understanding 
of the various search-engines to be able to do a fair comparison, and 
several hours to actually index the data (our new server took 4 hours to 
index 9GB of data in Xapian). So I can understand that someone chooses 
to do the comparison with much smaller datasets, most people don't have 
multi-gigabyte datasets anyway.

> Here is proof how fast search goes on 500GB of data using Xapian, can
> Lucene do that on single server? ... of course not.
> http://myhealthcare.com

Have you actually tried? I haven't, so I don't know the answer to that. 
Although I'm probably going to try Lucene or Solr for our 9GB corpus 
sometime soon. Mainly because we'd like some adjustments to the current 
searching in our forum, that isn't yet supported in Xapian and with our 
lack of C++-experience is quite hard to implement well.

Btw, these guys actually seem to have indexed the GOV2 corpus (400+GB) 
with Lucene:
http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
So it is at least possible, whether the search-time was any good is a 
bit harder to decipher.

Best regards,

Arjen

> PS: When blind leading the blind, at once they all fall off the cliff.
>  This is Information Technology not Banking Industry, we know the
> mathematics.
> 
> Thanks,
> Kevin Duraj
> http://myhealthcare.com
> 
> 
> On Mon, Jul 6, 2009 at 7:19 AM, Charlie Hull<charlie at juggler.net> wrote:
>> Hi all,
>>
>> You may find
>> http://developers.slashdot.org/story/09/07/06/131243/Open-Source-Search-Engine-Benchmarks
>> interesting.
>>
>> Xapian was rather slated for large index sixes and slow indexing, but
>> had comparable search performance to Lucene.
>>
>> Cheers
>>
>> Charlie
>>
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>