[Xapian-discuss] Xapian docs (was Re: Xapian-discuss Digest, Vol 83, Issue 2)

Andrew Betts andrew.betts at assanka.net
Sat Apr 2 12:47:21 BST 2011


> I think this is a shining example of how well Xapian works with large
> document collections. I was just discussing this with my colleagues here
> and one of the issues that came up is that we'd love Xapian to become
> really lot more popular but have found that the documentation's a bit
> difficult to get into, as is the API. 

I agree.  There are a few gotchas, as well as branch stuff like matchspy that is phenomenally useful, but largely undocumented and therefore underused.  (though by the looks of it matchspy is now in core).  I actually find the API docs to be a comprehensive reference, which is a great start - I've recently been trying to use various RabbitMQ wrappers for PHP and its incredibly frustrating not being able to look up the syntax for something even when you know what you want.   Xapian isn't like that - if I know what I'm looking for, I can find it easily and the docs are comprehensive on the subject.  What's missing is a well organised resource on how to implement Xapian at a more strategic level, and how to achieve various common use cases well in each of the supported languages.

> 
> So I was wondering: do you have any thoughts on improving this and would
> you like some help? I use Xapian a fair bit (mostly on
> www.reportbuyer.com) together with a new wrapper for our CMS and have a
> bit of spare time. I'd be happy to write up examples of how to use some
> of the bindings, particularly PHP as that's my area.

I'd also be happy to contribute.  A cookbook type format could be worth considering, like http://diveintogreasemonkey.org/patterns/index.html (though note that they haven't kept this up to date).  To a degree Xapian suffers from the same problem as RabbitMQ on high level docs - there's just a list of independently authored, inconsistently formatted articles many of which cover the same ground. See http://trac.xapian.org/wiki/Articles and http://www.rabbitmq.com/devtools.html.  Since Xapian has its own bindings for lots of languages, it should be a relatively straightforward matter to provide consistent, high level documentation that can include examples in multiple languages.

Anyway, happy to support this kind of project.  Can only be a good thing to get more people introduced to Xapian.

Andrew

> 
> 
>> Message: 1
>> Date: Thu, 31 Mar 2011 11:55:32 -0700
>> From: Kevin Duraj <kevinduraj at gmail.com>
>> Subject: [Xapian-discuss] Xapian Index: 607GB = 219 million of unique
>> 	documents
>> To: xapian-discuss at lists.xapian.org
>> Message-ID:
>> 	<AANLkTiku6tA06=s9hmX7nTcBHWSDfxdDgnHJuLUKhRBN at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>> 
>> It took approximately five days, having single process using one core
>> CPU and 6GB of memory to build this giant 607GB single Xapian index,
>> containing 219 million of unique documents (web sites).  So far I did
>> not found any other implementation that would enable me to build such
>> a single index containing over 200 million documents, while testing
>> Lucene, Solr, MySQL, Hadoop and Oracle.  Probably that would be the
>> real reason why Xapian was not approved last year, for Google's Summer
>> of Code. Xapian is the type of open source that they don't want you to
>> know about.
>> 
>> Following index can be search from: http://myhealthcare.com/
>> 
>> total 607G
>> -rw-r--r-- 1 kevin kevin   28 2011-03-31 06:09 iamchert
>> -rw-r--r-- 1 kevin kevin   14 2011-03-31 01:50 position.baseA
>> -rw-r--r-- 1 kevin kevin 622K 2011-03-31 06:09 position.baseB
>> -rw-r--r-- 1 kevin kevin 311G 2011-03-31 06:09 position.DB
>> -rw-r--r-- 1 kevin kevin   14 2011-03-30 17:19 postlist.baseA
>> -rw-r--r-- 1 kevin kevin 139K 2011-03-31 00:49 postlist.baseB
>> -rw-r--r-- 1 kevin kevin  70G 2011-03-31 00:49 postlist.DB
>> -rw-r--r-- 1 kevin kevin   14 2011-03-31 00:49 record.baseA
>> -rw-r--r-- 1 kevin kevin 261K 2011-03-31 01:24 record.baseB
>> -rw-r--r-- 1 kevin kevin 131G 2011-03-31 01:24 record.DB
>> -rw-r--r-- 1 kevin kevin   14 2011-03-31 01:24 termlist.baseA
>> -rw-r--r-- 1 kevin kevin 192K 2011-03-31 01:50 termlist.baseB
>> -rw-r--r-- 1 kevin kevin  96G 2011-03-31 01:50 termlist.DB
>> 
>> $ delve .
>> number of documents = 219344757
>> average document length = 28255.9
>> document length lower bound = 1
>> document length upper bound = 173153
>> highest document id ever used = 219344757
>> 
>> Cheers,
>> Kevin Duraj
>> http://myhealthcare.com
>> 
>> 
>> 
>> ------------------------------
>> 
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>> 
>> 
>> End of Xapian-discuss Digest, Vol 83, Issue 1
>> *********************************************
> 
> 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
> 
> 
> End of Xapian-discuss Digest, Vol 83, Issue 2
> *********************************************



More information about the Xapian-discuss mailing list