[Xapian-discuss] omindex hangs while scanning (update)
Elke + Rolf Koehling
er-koehling at gmx.de
Sun Jul 5 22:57:56 BST 2009
Hi Olly,
please excuse for the delay between your e-mail providing a patch with
more output.
The last two days I had some time to really trace down my problems:
-------------------------------------------------------------------------------------------
My problem comes running the omindex.exe under windows and the html
files did have
the windows line terminator ( CR / LF ). Converting the to the unix line
terminator (LF )
the software was running fine. The bug is the file loadfile.cc
while (n) {
int c = read(fd, blk, min(n, sizeof(blk)));
cout << "### read " << c << endl;
if (c < 0) {
The problem is the read() call which returns for my file off 592 bytes
with windows encoding (cr/lf)
only 570 bytes. I have changed the code to "if( c <= 0)" and it works
for me.
-------------------------------------------------------------------------------------------
After this was done I encountered another problem with my windows file
names which start with "D:"
and have " " (blanks) in there file or directory names. After looking at
the code I traced this down to
shell_protect() where I added the following patch - note the ' ' and the
':' . You should check if this
makes sense to you:
if (!isalnum(ch) && strchr("/._- :", ch) == NULL) {
-------------------------------------------------------------------------------------------
The next one came when I tried to index a word document which is done by
some external
program and the output was like:
$ antiword.exe -mUTF-8.txt D:/develop/apache22/htdocs/book/word 01.doc
::::::::::::::
D:/develop/apache22/htdocs/book/word
::::::::::::::
::::::::::::::
01.doc
::::::::::::::
I can't open 'D:/develop/apache22/htdocs/book/word' for reading
I can't open '01.doc' for reading
I've seen this before so I changed the code add quotes around the
external programm calls
which in following worked fine.
string cmd = "antiword -mUTF-8.txt \"" + shell_protect(file) +"\"";
-------------------------------------------------------------------------------------------
As I was already working with the code I have added a very small - but
for me very handy - feature
to skip some special directories while scanning. In addition to the
documentation in the directory itself
we have older versions in directories names 'archiv' which I did not
want to index. So I added
the commandline option "-x" and at line 639 the following code:
if( url.compare( skipdir ) == 0 )
{
cout << "skipping dir " << file << endl;
continue;
}
-------------------------------------------------------------------------------------------
The last problem I now have to solve is with catdoc, where I always have
the problem that
the software has problems to find the file ascii.spc. Do you have some
hints how to setup
the program under windows or for some other program?
-------------------------------------------------------------------------------------------
At the end my xapian / omega is now up and running, many many thanks for
your help. As
the code is very clear and easy to follow it was not very hard to trace
down my problems.
In exchange for your help I hope my inpup helps in your effort to
further improve this nice software.
Cheers
Rolf Köhling.
Olly Betts schrieb:
> On Tue, Jun 23, 2009 at 06:44:31PM +0200, "Elke Köhling" wrote:
>
>> Meanwhile I have played around a bit using scriptindex on oindex.pl
>> of the omega script example. Funny enough all the example files
>> from 'phil' do work on my box and omega with apache works as well.
>> Trying to index the 'book' example it hangs up. I will try to figure
>> out the difference tonight.
>>
>
> It's probably a bug in omindex then, in which case the extra output from
> the patched omindex.cc should be very helpful.
>
> Cheers,
> Olly
>
>
More information about the Xapian-discuss
mailing list