[Xapian-discuss] Swish3

Olly Betts olly at survex.com
Mon Mar 2 03:09:51 GMT 2009


On Fri, Feb 27, 2009 at 10:14:16PM -0600, Peter Karman wrote:
> Olly Betts wrote on 2/27/09 12:00 AM:
> > On Thu, Feb 26, 2009 at 08:25:43AM -0600, Peter Karman wrote:
> >> Swish3 is similar to Omega, but as I have outlined[3] on the
> >> Swish-e list, it offers some different features, notably robust XML
> >> parsing using libxml2.
> > 
> > Are you implying that Omega's XML parser is not robust?
> > 
> > I don't recall seeing any bug reports about this...
> 
> on the contrary. I did not intend to disparage Omega. I actually based the
> swish_xapian example program on omindex.[0]
> 
> What I was trying to say instead was that Swish3 is oriented toward XML
> explicitly, uses libxml2, and is DOM-aware. You seemed to suggest omindex's
> XmlParser "just strips all the tags"[1].

Ah, I see the confusion.  There's a (perhaps poorly named) HtmlParser
subclass in the omindex source code call XmlParser which does
essentially just strip the tags (since that's what's needed to extract
text from OpenDocument and some other XML formats), but there other
subclasses for parsing other XML formats.

The rather odd hierarchy here (XML being derived from HTML is backwards)
is really just a historical relic.

> Take this example. omindex must be told to recognize the .xml file extension:
> 
> [karpet at pekmac:~/projects/search_bench]$ omindex --db omega.db test
> omindex: --url not specified, assuming `/'.
> [Entering directory /]
> Unknown extension: "test/swish.xml" - skipping

Yes, omindex wants to know how to extract "text" from a format, and
optionally certain metadata.  There isn't a strategy for doing so that
applies universally to "XML" (since it's a framework for defining
formats than a single format).  Just stripping all the tags can work,
but can also either not find any text, or stuff that isn't text.
    
Most XML formats don't use the extension .xml but have their own
standard extension - omindex can index several such XML formats out of
the box.

Looking on my file system, most of the ".xml" files seem to be config
files which it probably isn't useful to try to index.

> So I'm sure the Omega XML parser is robust. Swish3 just has a different
> orientation and uses libxml2.

Sure - it sounds like the XML support is a lot fancier.  I was really
just objecting to you saying that "robust XML parsing" was a "different
feature" it had compared to Omega.

Cheers,
    Olly



More information about the Xapian-discuss mailing list