[Xapian-discuss] omindex hangs while scanning (update)

Rolf Koehling er-koehling at gmx.de
Tue Aug 4 21:51:02 BST 2009


Hi Olly,

I have been busy the last days but finally applied both of your patches 
and the new
version works just fine. Many thanks for your help.

Kind regards
Rolf Köhling.

P.S. As this discussion is listed in your mailing list could you please 
be so kind to
    remove the "Elke" from the messages? Many thanks in advance.

Olly Betts schrieb:
> There's no need to cc: me on mailing list replies.
>
> On Sun, Jul 05, 2009 at 11:57:56PM +0200, Elke + Rolf Koehling wrote:
>   
>> My problem comes running the omindex.exe under windows and the html 
>> files did have
>> the windows line terminator ( CR / LF ). Converting the to the unix line 
>> terminator (LF )
>> the software was running fine. The bug is the file loadfile.cc
>>
>>     while (n) {
>>         int c = read(fd, blk, min(n, sizeof(blk)));
>>         cout << "### read " << c << endl;
>>         if (c < 0) {
>>
>> The problem is the  read() call which returns for my file off 592 bytes 
>> with windows encoding (cr/lf)
>> only 570 bytes. I have changed the code to "if( c <= 0)" and it works 
>> for me.
>>     
>
> Ah, so you're using Cygwin's automatic end of line translation mode,
> which means there's less data to read than stat() reports?
>
> This code is also broken if the file gets truncated by another process
> between us reading the size and reading the data, and your fix is about
> right (we shouldn't check for EINTR in the EOF case. but IIRC you don't
> get EINTR on MS Windows).  This is what I've committed:
>
> http://trac.xapian.org/changeset/12995/trunk/xapian-applications/omega/loadfile.cc
>
>   
>> After this was done I encountered another problem with my windows file 
>> names which start with "D:"
>> and have " " (blanks) in there file or directory names. After looking at 
>> the code I traced this down to
>> shell_protect() where I added the following patch - note the ' ' and the 
>> ':' . You should check if this
>> makes sense to you:
>>
>>     if (!isalnum(ch) && strchr("/._- :", ch) == NULL) {
>>     
>
> No, that's not the right fix.  The problem here is that we're doing
> Unix-style shell escaping, which is wrong for MS Windows.  
>
> Here's my attempt at a better fix, but I'm not able to test this:
>
> http://oligarchy.co.uk/xapian/patches/omindex-windows-shell-protect.patch
>
> Let me know if that helps or not, and I'll commit it if it does.
>
>   
>> The next one came when I tried to index a word document which is done by 
>> some external
>> program and the output was like:
>>
>>     $ antiword.exe -mUTF-8.txt D:/develop/apache22/htdocs/book/word 01.doc
>>     ::::::::::::::
>>     D:/develop/apache22/htdocs/book/word
>>     ::::::::::::::
>>     ::::::::::::::
>>     01.doc
>>     ::::::::::::::
>>     I can't open 'D:/develop/apache22/htdocs/book/word' for reading
>>     I can't open '01.doc' for reading
>>
>> I've seen this before so I changed the code add quotes around the 
>> external programm calls
>> which in following worked fine.
>>
>>     string cmd = "antiword -mUTF-8.txt \"" + shell_protect(file) +"\"";
>>     
>
> This should be addressed by the patch above.
>
>   
>> As I was already working with the code I have added a very small - but 
>> for me very handy - feature
>> to skip some special directories while scanning. In addition to the 
>> documentation in the directory itself
>> we have older versions in directories names 'archiv' which I did not 
>> want to index. So I added
>> the commandline option "-x" and at line 639 the following code:
>>             if( url.compare( skipdir ) == 0 )
>>             {
>>                 cout << "skipping dir " << file << endl;
>>                 continue;
>>             }
>>     
>
> That's not going to work though, is it?  You want to check the leafname,
> not the URL...
>
> I can see this being generally useful (e.g. ignoring CVS directories)
> though perhaps specifying multiple directories should work.  But that
> could be added later as you would just accept multiple options to
> allow this.
>
> If you want to submit a patch, please create it with "diff -u" or "svn
> diff" rather than trying to describe the changes in english - this makes
> it much easier to duplicate the changes you've actually tested and to
> automatically apply them to a newer version of the sources.  Also,
> please try to follow the coding conventions of existing code.
>
> There is more advice here:
>
> http://trac.xapian.org/browser/trunk/xapian-core/HACKING#L958
>
>   
>> The last problem I now have to solve is with catdoc, where I always have 
>> the problem that
>> the software has problems to find the file ascii.spc.  Do you have some 
>> hints how to setup
>> the program under windows or for some other program?
>>     
>
> I don't, but try pasting the error message you get into Google and see
> if you can find someone else with the same problem who found a fix.
>
> Cheers,
>     Olly
>
>   



More information about the Xapian-discuss mailing list