omega issues/notes

Tue Oct 18 14:02:01 BST 2016

On Tue, Sep 27, 2016 at 04:32:33PM -0400, John Bankert wrote:
> I've run into a couple of things using omega/omindex under cygwin. I don't
> think I'd attribute them to xapian, omega or omindex, but wanted to get
> them out to the list so that if anyone else should run into these things
> down the road, hopefully someone will remember and be able to help.
> 
> 1) after compiling and building omega, and doing make install, I get a set
> violation when trying to run omindex from it's installed location under
> cygwin. I worked around this by copying various required windows dll files
> into the same directory as omindex.exe and presto, success.

I've no idea what a "set violation" is - is that a typo for "seg violation"
(short for "segmentation violation")?

Not sure I can offer much insight into this though - I haven't had to wrangle
DLLs for several decades.

> 2) There appears to be some sort of weird path issue in using omindex in
> the cywin bash shell. using the path /www/example/product, should in cygwin
> bash, act as a fully defined path the directory to be indexed by omindex.
> This is not the case. I had to product a relative path from where
> omindex.exe was running in order to successfully index the files in
> /www/example/product.

I tried to set up a build on appveyor to reproduce this, but it works for me:

https://ci.appveyor.com/project/ojwb/xapian/build/1.0.30

In particular:

    bash -c 'xapian-applications/omega/omindex -v --db omtest.db --url msproducts /www/example/products/'
    [Entering directory ""]
    Indexing "example.docx" as application/vnd.openxmlformats-officedocument.wordprocessingml.document ... Skipping - "unzip -p '/www/example/products/example.docx' word/document.xml 'word/header*.xml' 'word/footer*.xml' 2>/dev/null" failed
    Indexing "html.htm" as text/html ... added
    Indexing "sample.doc" as application/msword ... Skipping - "antiword -mUTF-8.txt '/www/example/products/sample.doc'" failed
    Indexing "text.txt" as text/plain ... added

I didn't install "unzip" or "antiword", so that's what I'd expect to happen.

> This next bit is me wondering about the output I've gotten.
> 
> John at win-7-test ~/xapian-omega-1.4.0
> $ ls -al ../../../www/example/msproducts/
> total 357
> drwx------+ 1 John None      0 Sep 27 16:25 .
> drwx------+ 1 John None      0 Sep 27 15:41 ..
> -rwx------+ 1 John None  32476 Sep 14 15:18 100-objects-v1.csv
> -rwx------+ 1 John None  32477 Sep 14 15:19 100-objects-v2.csv
> -rwx------+ 1 John None  14228 Aug 31 11:41 burger.docx
> -rwx------+ 1 John None  19034 Jun 30 12:15 hotdog.docx
> -rwx------+ 1 John None  10538 Sep 14 15:30 index.html
> -rwx------+ 1 John None 137728 Jun 30 12:15 sausage.doc
> -rwx------+ 1 John None  71536 Sep 14 15:21 states.csv
> -rwx------+ 1 John None    541 Sep 14 15:21 us_states_on_wikipedia.html
> -rwx------+ 1 John None  29824 Aug 31 15:08 zlib_how.html
> 
> John at win-7-test ~/xapian-omega-1.4.0
> $ ./omindex -v --db omtest.db --url msproducts
> ../../../www/example/msproducts/

Hmm, I notice here you have "www/example/msproducts", but above you said
"/www/example/product" - "msproducts" vs "product".  Could that be the
problem, or was the earlier one just a typo or hypothetical example?

> John at win-7-test ~/xapian-omega-1.4.0
> $ [Entering directory ""]

What was the exact command line you used to run the indexer here?  It seems to
have got lost from the paste, and would be useful to know.

> Indexing "100-objects-v1.csv" as text/csv ... added
> Indexing "100-objects-v2.csv" as text/csv ... added
> Indexing "burger.docx" as
> application/vnd.openxmlformats-officedocument.wordproc
> essingml.document ... The system cannot find the path specified.
> Skipping - "unzip -p "..\..\..\www\example\msproducts\burger.docx"
> word/document
> .xml 'word/header*.xml' 'word/footer*.xml' 2>/dev/null" failed
> Indexing "hotdog.docx" as
> application/vnd.openxmlformats-officedocument.wordproc
> essingml.document ... The system cannot find the path specified.
> Skipping - "unzip -p "..\..\..\www\example\msproducts\hotdog.docx"
> word/document
> .xml 'word/header*.xml' 'word/footer*.xml' 2>/dev/null" failed
> Indexing "index.html" as text/html ... added
> 
> omindex stops when it hits sausage.doc, and echo $? returns 0, so I've no
> idea why it doesn't want to process an ms word .doc file, although I
> suspect it may be related to the inability to process the .docx files. I
> should note that I performing this work on a windows VM that does not have
> MS office or open office installed, if that makes a difference.

That shouldn't matter.  By default omindex will try to use unzip (and internal
XML parsing) for .docx and antiword for .doc.

I can't see why it shouldn't try to handle the other files in the directory
- in my test it continues after both the .docx and .doc failures.

Cheers,
    Olly