[Xapian-discuss] omindex hangs while scanning (update)

Olly Betts olly at survex.com
Mon Jul 6 04:40:08 BST 2009


There's no need to cc: me on mailing list replies.

On Sun, Jul 05, 2009 at 11:57:56PM +0200, Elke + Rolf Koehling wrote:
> My problem comes running the omindex.exe under windows and the html 
> files did have
> the windows line terminator ( CR / LF ). Converting the to the unix line 
> terminator (LF )
> the software was running fine. The bug is the file loadfile.cc
> 
>     while (n) {
>         int c = read(fd, blk, min(n, sizeof(blk)));
>         cout << "### read " << c << endl;
>         if (c < 0) {
> 
> The problem is the  read() call which returns for my file off 592 bytes 
> with windows encoding (cr/lf)
> only 570 bytes. I have changed the code to "if( c <= 0)" and it works 
> for me.

Ah, so you're using Cygwin's automatic end of line translation mode,
which means there's less data to read than stat() reports?

This code is also broken if the file gets truncated by another process
between us reading the size and reading the data, and your fix is about
right (we shouldn't check for EINTR in the EOF case. but IIRC you don't
get EINTR on MS Windows).  This is what I've committed:

http://trac.xapian.org/changeset/12995/trunk/xapian-applications/omega/loadfile.cc

> After this was done I encountered another problem with my windows file 
> names which start with "D:"
> and have " " (blanks) in there file or directory names. After looking at 
> the code I traced this down to
> shell_protect() where I added the following patch - note the ' ' and the 
> ':' . You should check if this
> makes sense to you:
> 
>     if (!isalnum(ch) && strchr("/._- :", ch) == NULL) {

No, that's not the right fix.  The problem here is that we're doing
Unix-style shell escaping, which is wrong for MS Windows.  

Here's my attempt at a better fix, but I'm not able to test this:

http://oligarchy.co.uk/xapian/patches/omindex-windows-shell-protect.patch

Let me know if that helps or not, and I'll commit it if it does.

> The next one came when I tried to index a word document which is done by 
> some external
> program and the output was like:
> 
>     $ antiword.exe -mUTF-8.txt D:/develop/apache22/htdocs/book/word 01.doc
>     ::::::::::::::
>     D:/develop/apache22/htdocs/book/word
>     ::::::::::::::
>     ::::::::::::::
>     01.doc
>     ::::::::::::::
>     I can't open 'D:/develop/apache22/htdocs/book/word' for reading
>     I can't open '01.doc' for reading
> 
> I've seen this before so I changed the code add quotes around the 
> external programm calls
> which in following worked fine.
> 
>     string cmd = "antiword -mUTF-8.txt \"" + shell_protect(file) +"\"";

This should be addressed by the patch above.

> As I was already working with the code I have added a very small - but 
> for me very handy - feature
> to skip some special directories while scanning. In addition to the 
> documentation in the directory itself
> we have older versions in directories names 'archiv' which I did not 
> want to index. So I added
> the commandline option "-x" and at line 639 the following code:
>             if( url.compare( skipdir ) == 0 )
>             {
>                 cout << "skipping dir " << file << endl;
>                 continue;
>             }

That's not going to work though, is it?  You want to check the leafname,
not the URL...

I can see this being generally useful (e.g. ignoring CVS directories)
though perhaps specifying multiple directories should work.  But that
could be added later as you would just accept multiple options to
allow this.

If you want to submit a patch, please create it with "diff -u" or "svn
diff" rather than trying to describe the changes in english - this makes
it much easier to duplicate the changes you've actually tested and to
automatically apply them to a newer version of the sources.  Also,
please try to follow the coding conventions of existing code.

There is more advice here:

http://trac.xapian.org/browser/trunk/xapian-core/HACKING#L958

> The last problem I now have to solve is with catdoc, where I always have 
> the problem that
> the software has problems to find the file ascii.spc.  Do you have some 
> hints how to setup
> the program under windows or for some other program?

I don't, but try pasting the error message you get into Google and see
if you can find someone else with the same problem who found a fix.

Cheers,
    Olly



More information about the Xapian-discuss mailing list