[Xapian-discuss] Using Xapian for webserver logs...?

John Pye john.pye at student.unsw.edu.au
Wed May 17 09:40:57 BST 2006


Hi Michael

Thanks for your thoughts on this. I actually did implement all this with
an RDBMS (MySQL to be specific) and I found some problems with the
approach. Basically it just didn't scale to the number of records that I
needed. Probably this was due to poor database design, but a few of us
have tackled the problem and we're none the wiser so far. Some issue were:

    * sparsity of the data: not all hits have an article type or
      username or topic tag. Normalising the database fully meant many
      tables and compex SQL queries.
    * rigid table structure: adding new metadata fields requires
      changing the database structure, which means hard work
    * size of the database and speed of queries
    * designing an interface that allowed users to flexibly perform the
      type of query they wanted was difficult. A google-style search box
      with some simple syntax would be nicer -- something that could
      probably be easy with Xapian, I thought.
    * it's hard to add new metadata fields on the fly.
    * aggregate queries were super slow; perhaps an index with knowledge
      of the 'importance' of rare keywords would be able to improve
      speed on this.


The idea is that all this should be done with 'tags' and a flat format,
keeping away from all the intricacy and complexity of designing
efficiently with an RDBMS.

Again, I wonder if you have any thoughts on how a system like the
widely-advertised Splunk might be performing its indexing?

Cheers
JP

Michael Schlenker wrote:
> John Pye schrieb:
>   
>> I have another idea for an application of the Xapian indexing system. I
>> think that it's probably not exactly what Xapian is all about, but
>> nevertheless, I wonder if you have any comments or alternative suggestions.
>>
>> The aim is to provide a system that indexes Apache web server logs for a
>> news-style website content management system. We index articles, issues,
>> sections of a set of monthly or weekly publications. Articles have topic
>> tags and we also have information about who (username) is visiting out
>> site, and when and from where.
>>
>> What we want to be able to do is to index the webserver's accesses so
>> that we can do full drill-down and find all hits from people in a
>> particular country on a particular day, or all hits on a particular
>> article, etc.
>>     
>
> Sounds more like you want a RDBMS and do data warehousing/decision
> support type stuff with it.
>
>   
>> I thought that Xapian, particularly using its boolean mode of operation,
>> might be suitable for this type of indexing and querying in a way that
>> perhaps conventional RDBMS are not. Each 'hit' would become a 'document'
>> in Xapian, so there would soon be millions of 'documents' but with
>> relatively few 'keywords' such as username, date, article title, etc. 
>> Would you agree with that thought? If not, would you suggest a different
>> approach, perhaps some more suitable software? I was thinking of Splunk
>> and wondering how they might have implemented their system. Would such
>> indexing and search be feasible with a single shared server?
>>     
>
> You could do things like that with Xapians API, the main question is
> 'why?'. You seem to not do any meaningful fulltext search.
>
> I would simply parse the logfiles, store the 'dimensions' your
> interested in into a suitable RDBMS, an then use that to drill down. A
> RDBMS is probably more suitable for this task, but you have to invest
> some time to design proper table structures for the type of questions
> you want answered.
>
> What can be useful is combining xapian with a RDBMS to index documents
> for fulltext search as an alternative access path to metadata retrieved
> from an RDBMS. Depends on your application. For web server log files i
> don't see it.
>
> Michael
>
>   

-- 
John Pye
School of Mechanical and Manufacturing Engineering
The University of New South Wales
Sydney  NSW 2052  Australia
t +61 2 9385 5127
f +61 2 9663 1222
mailto:john.pye_AT_student_DOT_unsw.edu.au
http://pye.dyndns.org/




More information about the Xapian-discuss mailing list