[Xapian-discuss] Xapian vs Lucene

Reini Urban rurban at x-ray.at
Thu Feb 1 07:10:48 GMT 2007


Yannick Warnier schrieb:
> It's probably quite troll-risky to put a title like this, but did anyone
> take the trouble to compare Lucene to Xapian and make a list of
> differences?

I compared Lucene C# against xapian in a rather non-technical way.
Lucene C# CAN do UTF-8, has much better MS Office and PDF parsing and 
searching (native windows techniques), but is rather awkward to 
customize. The C# they used was very compiler dependent.
Lucene C# has got a filechange notification hook, which is cool.

We use in our company both, I'm doing the xapian search engine,
a colleague built the Lucene search on windows only.

I think I won in the long term because it was easier for me to customize 
it. I developed and tested the xapian piece on cygwin, and then moved 
with the production engine to linux which was 10 times faster.


> As I told the list at the end of last year, I'm going to have to
> integrate an indexing/search engine in the coming weeks or months. It
> will be integrated to Dokeos, an open-source e-learning application in
> PHP, and at the moment we are using MnoGoSearch which is alright but the
> problem lies in the indexing engine that we cannot really provide with
> our application as only the Linux version is GPL and it runs as a C
> program that has to be run via cron. Also, the free/collaborative
> support and mailing-list activity are a bit too loose/slow.
> 
> So far, my understanding is that I can use Xapian PHP bindings to index
> "on the fly" when inserting new content in my e-learning application. It
> is also my understanding that Lucene is a piece of code in Java (which
> is wrong for me as long as it involves more languages than just PHP for
> the Dokeos administrators to deal with) that is quite popular and that
> does things alright.
> 
> One problem I know of (from a Perl programmer) about Lucene is that the
> Perl bindings do not actually handle unicode characters, and so the
> *universality* of Lucene is lost when using it via the Perl bindings.
> 
> Of course, Dokeos-wise, it is important to have UTF-8 handling as we
> plan to move to full-UTF-8 just before we start integrating the new
> indexing...*stuff*.
> 
> As far as I am aware of, my search application (as a finished/integrated
> product) should deal with:
> - indexing of webpages

both very good.
> - indexing of documents (all office documents)

lucene C# the best of alöl, the java lucene not that good.

> - indexing/parsing of XML metadata

both good.

> - awareness of user permissions (a result should only display if the
> searching user is authorized to see it)

This was complicated to achieve with xapian. Native Windows Lucene C# is 
better here.
For now we - xapian-omega - stick with http auth via a mod_ntlm backend, 
which handles the windows auth tokens automatically. No login required 
on MSIE, just firefox.

I still have to implement the ldap backend within omega or php for users 
and groups to check against the acl's.

Or simply do it via suexec and map the user into samba for the cgi call 
only. But I still have to persuade our IT to let me use samba.
Security-wise linux with a working suexec and samba is better than 
native windows.

> So, my question is: which is the best for my case? Lucene or Xapian? Any
> benchmarks or comparisons available?
> 
> Of course, this is specialised advice and I should really post the same
> mail to the Lucene list, but I'm not subscribed there yet, so for now I
> will analyse the feedback I get from here only (which will obviously
> distort it just a little bit).
-- 
Reini Urban
http://phpwiki.org/  http://murbreak.at/
http://helsinki.at/  http://spacemovie.mur.at/



More information about the Xapian-discuss mailing list