[Xapian-discuss] Xapian-discuss Digest, Vol 83, Issue 1

Justin Finkelstein justin at redwiredesign.com
Fri Apr 1 12:18:01 BST 2011


I think this is a shining example of how well Xapian works with large
document collections. I was just discussing this with my colleagues here
and one of the issues that came up is that we'd love Xapian to become
really lot more popular but have found that the documentation's a bit
difficult to get into, as is the API. 

So I was wondering: do you have any thoughts on improving this and would
you like some help? I use Xapian a fair bit (mostly on
www.reportbuyer.com) together with a new wrapper for our CMS and have a
bit of spare time. I'd be happy to write up examples of how to use some
of the bindings, particularly PHP as that's my area.


> Message: 1
> Date: Thu, 31 Mar 2011 11:55:32 -0700
> From: Kevin Duraj <kevinduraj at gmail.com>
> Subject: [Xapian-discuss] Xapian Index: 607GB = 219 million of unique
> 	documents
> To: xapian-discuss at lists.xapian.org
> Message-ID:
> 	<AANLkTiku6tA06=s9hmX7nTcBHWSDfxdDgnHJuLUKhRBN at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> It took approximately five days, having single process using one core
> CPU and 6GB of memory to build this giant 607GB single Xapian index,
> containing 219 million of unique documents (web sites).  So far I did
> not found any other implementation that would enable me to build such
> a single index containing over 200 million documents, while testing
> Lucene, Solr, MySQL, Hadoop and Oracle.  Probably that would be the
> real reason why Xapian was not approved last year, for Google's Summer
> of Code. Xapian is the type of open source that they don't want you to
> know about.
> 
> Following index can be search from: http://myhealthcare.com/
> 
> total 607G
> -rw-r--r-- 1 kevin kevin   28 2011-03-31 06:09 iamchert
> -rw-r--r-- 1 kevin kevin   14 2011-03-31 01:50 position.baseA
> -rw-r--r-- 1 kevin kevin 622K 2011-03-31 06:09 position.baseB
> -rw-r--r-- 1 kevin kevin 311G 2011-03-31 06:09 position.DB
> -rw-r--r-- 1 kevin kevin   14 2011-03-30 17:19 postlist.baseA
> -rw-r--r-- 1 kevin kevin 139K 2011-03-31 00:49 postlist.baseB
> -rw-r--r-- 1 kevin kevin  70G 2011-03-31 00:49 postlist.DB
> -rw-r--r-- 1 kevin kevin   14 2011-03-31 00:49 record.baseA
> -rw-r--r-- 1 kevin kevin 261K 2011-03-31 01:24 record.baseB
> -rw-r--r-- 1 kevin kevin 131G 2011-03-31 01:24 record.DB
> -rw-r--r-- 1 kevin kevin   14 2011-03-31 01:24 termlist.baseA
> -rw-r--r-- 1 kevin kevin 192K 2011-03-31 01:50 termlist.baseB
> -rw-r--r-- 1 kevin kevin  96G 2011-03-31 01:50 termlist.DB
> 
> $ delve .
> number of documents = 219344757
> average document length = 28255.9
> document length lower bound = 1
> document length upper bound = 173153
> highest document id ever used = 219344757
> 
> Cheers,
> Kevin Duraj
> http://myhealthcare.com
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
> 
> 
> End of Xapian-discuss Digest, Vol 83, Issue 1
> *********************************************





More information about the Xapian-discuss mailing list