[Xapian-discuss] tiff / image pdf filter

Sun Mar 22 22:17:09 GMT 2009

On Fri, Mar 20, 2009 at 01:41:18PM +1030, Frank J Bruzzaniti wrote:
> Unless you proofread the documents I guess there is a good chance there 
> will be some garbage data, would you say that's it'd detrimental to the 
> index and should be left out?

If it's junk which doesn't contain terms users will actually search for,
then it'll mostly just uselessly increase the size of the database.
That'll slow indexing down a bit, and possibly search too, though likely
less than indexing.  But it won't affect search results other than
rather indirectly because the document length will be increased.

If the junk happens to contains terms which users actually search for
it can result in false match, which is bad - it makes the results less
useful, and it's confusing to get a result which doesn't seem to match
what you searched for.

The problem with gocr seemed to be that it didn't have a threshold
below which it gave up trying to interpret something as text (or it
had one which defaulted rather low and wasn't documented where I
looked for it), so you got a lot of garbage for an image, which makes
it more likely that "real" terms would be generated.

The more usual OCR errors (modern being read as modem, etc) also harm
retrieval effectiveness a bit of course, but seem more excusable than
trying to interpret everything on the page as text.

Cheers,
    Olly