[Xapian-discuss] omindex hangs while scanning (update)

Elke + Rolf Koehling er-koehling at gmx.de
Sun Jul 5 22:57:56 BST 2009


Hi Olly,

please excuse for the delay between your e-mail providing a patch with 
more output.
The last two days I had some time to really trace down my problems:

-------------------------------------------------------------------------------------------
My problem comes running the omindex.exe under windows and the html 
files did have
the windows line terminator ( CR / LF ). Converting the to the unix line 
terminator (LF )
the software was running fine. The bug is the file loadfile.cc

    while (n) {
        int c = read(fd, blk, min(n, sizeof(blk)));
        cout << "### read " << c << endl;
        if (c < 0) {

The problem is the  read() call which returns for my file off 592 bytes 
with windows encoding (cr/lf)
only 570 bytes. I have changed the code to "if( c <= 0)" and it works 
for me.


-------------------------------------------------------------------------------------------
After this was done I encountered another problem with my windows file 
names which start with "D:"
and have " " (blanks) in there file or directory names. After looking at 
the code I traced this down to
shell_protect() where I added the following patch - note the ' ' and the 
':' . You should check if this
makes sense to you:

    if (!isalnum(ch) && strchr("/._- :", ch) == NULL) {

-------------------------------------------------------------------------------------------
The next one came when I tried to index a word document which is done by 
some external
program and the output was like:

    $ antiword.exe -mUTF-8.txt D:/develop/apache22/htdocs/book/word 01.doc
    ::::::::::::::
    D:/develop/apache22/htdocs/book/word
    ::::::::::::::
    ::::::::::::::
    01.doc
    ::::::::::::::
    I can't open 'D:/develop/apache22/htdocs/book/word' for reading
    I can't open '01.doc' for reading

I've seen this before so I changed the code add quotes around the 
external programm calls
which in following worked fine.

    string cmd = "antiword -mUTF-8.txt \"" + shell_protect(file) +"\"";

-------------------------------------------------------------------------------------------
As I was already working with the code I have added a very small - but 
for me very handy - feature
to skip some special directories while scanning. In addition to the 
documentation in the directory itself
we have older versions in directories names 'archiv' which I did not 
want to index. So I added
the commandline option "-x" and at line 639 the following code:
            if( url.compare( skipdir ) == 0 )
            {
                cout << "skipping dir " << file << endl;
                continue;
            }

-------------------------------------------------------------------------------------------
The last problem I now have to solve is with catdoc, where I always have 
the problem that
the software has problems to find the file ascii.spc.  Do you have some 
hints how to setup
the program under windows or for some other program?

-------------------------------------------------------------------------------------------

At the end my xapian / omega is now up and running, many many thanks for 
your help. As
the code is very clear and easy to follow it was not very hard to trace 
down my problems.
In exchange for your help I hope my inpup helps in your effort to 
further improve this nice software.

Cheers
Rolf Köhling.


Olly Betts schrieb:
> On Tue, Jun 23, 2009 at 06:44:31PM +0200, "Elke Köhling" wrote:
>   
>> Meanwhile I have played around a bit using scriptindex on oindex.pl
>> of the omega script example. Funny enough all the example files
>> from 'phil' do work on my box and omega with apache works as well.
>> Trying to index the 'book' example it hangs up. I will try to figure
>> out the difference tonight.
>>     
>
> It's probably a bug in omindex then, in which case the extra output from
> the patched omindex.cc should be very helpful.
>
> Cheers,
>     Olly
>
>   



More information about the Xapian-discuss mailing list