[Xapian-discuss] Omindex Filters

Thu Sep 18 22:36:04 BST 2008

Olly Betts writes:
 > I had a crude stab at measuring the overhead - I took the rcldoc filter
 > from Recoll 1.3.3 (just because I happen to have that source tree
 > unpacked already) and timed it unpacking the same 300KB word document
 > (a random scientific paper) 800 times like so:
 > 
 > [numbers]
 >
 > Conclusion - rcldoc is 42% slower, and I've not factored in the extra
 > time omindex would need to spend parsing the HTML.  Now I understand
 > that doc isn't a trivial to parse format, so I think this crude test
 > is indicative.  Also, I did it on Linux which has a low process start
 > overhead.  On cygwin this would be much worse.

I can't argue with this obviously. Even if I'd take a bet that I can
probably get the number down to 20%, there always will be an overhead
(which by the way is mainly a problem if indexing is cpu-bound). The
balance of issues is different for omindex which works on huge datasets,
and Recoll which is a personal tool. For Recoll I've found that having
wrappers scripts and a pivot format was worth the additional load because:
 - I want to have a way to transport character set (and other) information.
 - I find it convenient to have an isolation/diagnostic layer between the c
   code and the translaters.
 - I have never been told that Recoll indexing slowness was a problem.
 - I like scripting in weird languages :)

 > I'm not trying to knock Recoll (or Estraier which seems to be where
 > these filters originated) here, just pointing out why I have
 > reservations about the approach.

Understood, especially given the stricter performance constraints on omindex.

Cheers,
jf