[Xapian-discuss] Future of Xapian (long)

Francis Irving francis@flourish.org
Sun, 20 Jun 2004 16:32:13 +0100


On Fri, Jun 18, 2004 at 04:47:54PM +0100, Charlie Hull wrote:
> We've been trying to compile a 'wishlist' for Xapian
> improvements/features and this is what we have so far (in no
> particular order): we'd like some idea of what people regard as a
> priority.
 
> b. A web server for Xapian.

Why?  Sounds like an actually bad idea to me.

> c. A summarizer/highlighter component; we've noticed that TheyWorkForYou.org
> have this already but we also have some code to do this.

Yes we do.  This is relatively straightforward with a simple search,
but much harder with stemming (which we don't do at the moment).

QueryParser may be one place to put this, although would be good to
be able to do it with any query.  Would like two function:

1. Takes a document, a query and a required excerpt length.  Function
returns a suggested place for excerpt to begin and end (not breaking
words in half).  I talked to Olly about this in the pub the other
week.  It would scan for the window containing the largest bulk of
relevant terms.  This means that if you have several words together
at the end of the document, that would be returned, rather than one
word at the start.

2. Takes a document, a query and a highlight prefix/suffix.  Returns
the document with a highlighting.  Bonus feature is different colours
for different search terms (like the Google cache does).

> d. A spellchecker (like Google's 'did you mean xxx') using edit distance
> calculation.

That would be nice to have.  How hard is that to do?  Would it be
fast?  What extra information do you index to do it?

> e. A web spider

Why?  The main benefit of Xapian is that it doesn't spider, but can be
used to search structured data in a database.  I haven't looked at
Omega, but a separate project like that seems more the place to put a
spider.  Not as part of Xapian itself.

> f. An easy(ier) way of plugging in the various open source file format
> converters, for indexing Ms Office and other formats, with a list of which
> ones actually work!

Again I see this as being higher level than the core Xapian
information retrieval API.

> g. More example programs, setup HOWTOs etc. to make the initial learning
> curve a bit less steep.

Yes, yes, yes.  And better online documentation, probably with
comments or some form of easy feedback so you can update it.

The basic example is fine, but doing fancier stuff (working out how
to replace documents, date sorting etc.) is hard.  More examples which
use every feature in xapian-examples would be good.

Documentation for the Perl/PHP/Python etc. bindings.  Ideally all of
xapian-examples ported to each one.

> h. A connector to ASP; some way of easily integrating Xapian results into
> ASP pages. We've done something similar in the past for another search
> engine.
> i. Native compilation under Windows.

I would add this one here, which is very important:

j. Complete bindings.  Fix all the various naming problems and error
reporting problems in all the bindings (Perl and PHP in particular),
and make them feature complete so you can do anything in Xapian with
them.  People are quite likely to use the bindings, rather than C++,
for a search engine, and the incompleteness of the bindings would have
put me off if I'd had a choice.  

And if you lot weren't incredibly responsive with patches ;)

Also this one:

k.  Make it easier to to date sorting, and meta-variable weighting.
Things like adding a date weight which decays at it goes back in time.
And some of the other experimental stuff which is the code.  By "make
it easier", I mean finish off what there is to do this, document it
well in a manual (you probably need a proper manual), and make sure
it works from all the language bindings.
 
> 2. Where would be good projects/places to get Xapian accepted as a search
> engine? Obviously the more people using Xapian the better, as it drives
> improvements, finds bugs etc.
> 
> a. Content management systems (CMS), e.g. Zope (has anyone tried this?)
> APLAWS (a Redhat-based local authority CMS)
> b. Linux distributions
> c. Academic institutions, many of which can't afford commercial engines and
> usually end up using Google site search or htDig.
> d. Web developers and other organisations that regularly use open source
> software but may not know about Xapian.

You could try talking to the Gnome people doing desktop search stuff.

I believe you're making Debian packages, which would certainly help
uptake.  Not sure how the BSD's work, but shouldn't be that hard to
get in their ports system?

In the political-hackers world, I'll certainly mention Xapian to
anyone doing projects (for example, to consider it if appropriate for
various www.mysociety.org projects).

> 3. We're also thinking of offering various levels of commercial support for
> Xapian, from the 'pay a small flat fee and we'll get it up and running' to
> full 24-hour support. Does anyone have any comments about this? It might
> help to get Xapian accepted in commercial organisations that need some kind
> of 'formal' backup.

Unfortunately I'm not likely to use it for commercial purposes, at
least not just now.

Maybe you could make a Xapian-in-a-box product to compete with
Google-in-a-box -- then the various PDF searching features and the
like which you mention make sense.

Francis