[Xapian-discuss] Omindex Filters
Jean-Francois Dockes
jean-francois.dockes at wanadoo.fr
Thu Sep 18 22:36:04 BST 2008
Olly Betts writes:
> I had a crude stab at measuring the overhead - I took the rcldoc filter
> from Recoll 1.3.3 (just because I happen to have that source tree
> unpacked already) and timed it unpacking the same 300KB word document
> (a random scientific paper) 800 times like so:
>
> [numbers]
>
> Conclusion - rcldoc is 42% slower, and I've not factored in the extra
> time omindex would need to spend parsing the HTML. Now I understand
> that doc isn't a trivial to parse format, so I think this crude test
> is indicative. Also, I did it on Linux which has a low process start
> overhead. On cygwin this would be much worse.
I can't argue with this obviously. Even if I'd take a bet that I can
probably get the number down to 20%, there always will be an overhead
(which by the way is mainly a problem if indexing is cpu-bound). The
balance of issues is different for omindex which works on huge datasets,
and Recoll which is a personal tool. For Recoll I've found that having
wrappers scripts and a pivot format was worth the additional load because:
- I want to have a way to transport character set (and other) information.
- I find it convenient to have an isolation/diagnostic layer between the c
code and the translaters.
- I have never been told that Recoll indexing slowness was a problem.
- I like scripting in weird languages :)
> I'm not trying to knock Recoll (or Estraier which seems to be where
> these filters originated) here, just pointing out why I have
> reservations about the approach.
Understood, especially given the stricter performance constraints on omindex.
Cheers,
jf
More information about the Xapian-discuss
mailing list