[Xapian-discuss] Xapian Index: 607GB = 219 million of unique documents

Kevin Duraj kevinduraj at gmail.com
Thu Mar 31 19:55:32 BST 2011


It took approximately five days, having single process using one core
CPU and 6GB of memory to build this giant 607GB single Xapian index,
containing 219 million of unique documents (web sites).  So far I did
not found any other implementation that would enable me to build such
a single index containing over 200 million documents, while testing
Lucene, Solr, MySQL, Hadoop and Oracle.  Probably that would be the
real reason why Xapian was not approved last year, for Google's Summer
of Code. Xapian is the type of open source that they don't want you to
know about.

Following index can be search from: http://myhealthcare.com/

total 607G
-rw-r--r-- 1 kevin kevin   28 2011-03-31 06:09 iamchert
-rw-r--r-- 1 kevin kevin   14 2011-03-31 01:50 position.baseA
-rw-r--r-- 1 kevin kevin 622K 2011-03-31 06:09 position.baseB
-rw-r--r-- 1 kevin kevin 311G 2011-03-31 06:09 position.DB
-rw-r--r-- 1 kevin kevin   14 2011-03-30 17:19 postlist.baseA
-rw-r--r-- 1 kevin kevin 139K 2011-03-31 00:49 postlist.baseB
-rw-r--r-- 1 kevin kevin  70G 2011-03-31 00:49 postlist.DB
-rw-r--r-- 1 kevin kevin   14 2011-03-31 00:49 record.baseA
-rw-r--r-- 1 kevin kevin 261K 2011-03-31 01:24 record.baseB
-rw-r--r-- 1 kevin kevin 131G 2011-03-31 01:24 record.DB
-rw-r--r-- 1 kevin kevin   14 2011-03-31 01:24 termlist.baseA
-rw-r--r-- 1 kevin kevin 192K 2011-03-31 01:50 termlist.baseB
-rw-r--r-- 1 kevin kevin  96G 2011-03-31 01:50 termlist.DB

$ delve .
number of documents = 219344757
average document length = 28255.9
document length lower bound = 1
document length upper bound = 173153
highest document id ever used = 219344757

Cheers,
Kevin Duraj
http://myhealthcare.com



More information about the Xapian-discuss mailing list