[Xapian-discuss] Using Xapian for webserver logs...?

Wed May 17 11:15:24 BST 2006

On Wed, May 17, 2006 at 06:40:57PM +1000, John Pye wrote:

>     * sparsity of the data: not all hits have an article type or
>       username or topic tag. Normalising the database fully meant many
>       tables and compex SQL queries.

Yes, you'll need some sort of star schema I'd guess. The SQL shouldn't
be too complex, and I suspect if you'd based this on an enterprise
RDBMS the supplied reporting/BI/DM framework would probably have done
most of your work for you. OTOH these things aren't cheap :-/

>     * rigid table structure: adding new metadata fields requires
>       changing the database structure, which means hard work
>     * size of the database and speed of queries

Your database is going to get pretty big; there's not really a way of
avoiding that. I suspect (without any evidence whatsoever :-) that the
Xapian overhead shouldn't be all that different from the index
overhead in a well-designed relational schema for this, since they're
doing somewhat similar things. The difference is that where you have
to map dimensions onto tables in a star schema, you have to map
dimensions onto term prefixes. MySQL may not be able to do decent
compression in its tables in the way Xapian uses dictionaries and so
forth, so there may be a sizable gap there (although there are ways
round it on the MySQL side).

>     * designing an interface that allowed users to flexibly perform the
>       type of query they wanted was difficult. A google-style search box
>       with some simple syntax would be nicer -- something that could
>       probably be easy with Xapian, I thought.

What were you thinking of? The problem here is that you want people
to, say, be able to ask for "any failed request in the /products
hierarchy on May 13th". Date restrict is a distinct UI widget
(probably), and a relational system will be able to do this better
than Xapian (although Xapian can take a swing at it if you use an
omega-style approach); I suppose the "failed request" could be a
response code filter, but I don't know how intuitive that would
be. Depends on your audience, I guess.

>     * it's hard to add new metadata fields on the fly.

But with Xapian you'd have to reindex everything, wouldn't you? Or
you could just do it for new data, I suppose... even so, I can't
imagine it taking longer than a day to write something that generates
SQL DDL statements on the fly to support your new metadata
requirements as you specify them.

>     * aggregate queries were super slow; perhaps an index with knowledge
>       of the 'importance' of rare keywords would be able to improve
>       speed on this.

I doubt it, but I could be wrong. It's more likely that you're not
throwing enough indexes at the problem (or the right indexes, in
MySQL's case perhaps). With Xapian you're effectively building
semi-efficient indexes of everything, while with an RDBMS you're
building highly-efficient indexes of particular subsets of
everything. It's easy to miss corner cases in complex ad hoc queries
(particularly in MySQL, whose optimiser is ... entertaining).

The only bit Xapian can help you with here is some of the
optimisations that stop looking for new results once the MSet is full,
I think.

> The idea is that all this should be done with 'tags' and a flat format,
> keeping away from all the intricacy and complexity of designing
> efficiently with an RDBMS.

What you're trading is your hard work on getting the RDBMS to do
(probably) the most efficient job possible against work already done
in Xapian which won't do quite as efficient a job because it isn't
optimised for your particular situation. However Xapian will probably
do a better job than an *inefficient* RDBMS implementation will ever
manage, and it may be very difficult to get an efficient RDBMS
implementation done cheaply.

I'd probably recommend you to try it. Shouldn't take more than a
couple of days to get a Xapian testbed up and running for this data.

> Again, I wonder if you have any thoughts on how a system like the
> widely-advertised Splunk might be performing its indexing?

It seems to have its own index. I'm guessing it's effectively
relationship inside, based on some of their diagrams; whether it's
using something off the shelf, such as sqlite or HSQL or something, or
they've created their own system, I have no idea... again, the data
modelling is probably the place where they've really put the hard work
in.i

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org