peter at peknet.com
Sat Feb 28 04:14:16 GMT 2009
Olly Betts wrote on 2/27/09 12:00 AM:
> On Thu, Feb 26, 2009 at 08:25:43AM -0600, Peter Karman wrote:
>> Swish3 is similar to Omega, but as I have outlined on the Swish-e list, it
>> offers some different features, notably robust XML parsing using libxml2.
> Are you implying that Omega's XML parser is not robust?
> I don't recall seeing any bug reports about this...
on the contrary. I did not intend to disparage Omega. I actually based the
swish_xapian example program on omindex.
What I was trying to say instead was that Swish3 is oriented toward XML
explicitly, uses libxml2, and is DOM-aware. You seemed to suggest omindex's
XmlParser "just strips all the tags". Swish3 keeps track of tags (context)
during parsing and tokenization, and tracks the DOM for every token. You can
also configure "virtual tags" which combine XML attributes and element names in
ad hoc ways.
Take this example. omindex must be told to recognize the .xml file extension:
[karpet at pekmac:~/projects/search_bench]$ omindex --db omega.db test
omindex: --url not specified, assuming `/'.
[Entering directory /]
Unknown extension: "test/swish.xml" - skipping
While the swish_xapian script knows about .xml by default:
[karpet at pekmac:~/projects/search_bench]$ swish_xapian test/swish.xml --index
parse_file for test/swish.xml .... added.
The Swish3 config file format allows you to automatically add Xapian term
prefixes (in Swish lingo, MetaNames) based on DOM context, both for HTML and
XML. And to likewise store arbitrary document content (PropertyNames). You can
do those kinds of things with omindex and scriptindex too. Swish3 just makes it
possible to do it via a config file.
So I'm sure the Omega XML parser is robust. Swish3 just has a different
orientation and uses libxml2.
Peter Karman . http://peknet.com/ . peter at peknet.com
More information about the Xapian-discuss