[Xapian-discuss] Not separating words when parsing HTML in Omega
Crowell, Brian
BCrowell at barbnet.com
Wed Feb 9 21:11:18 GMT 2011
We noticed, when indexing a Word 2007 document, that two words in
adjacent paragraphs got melded together in the Xapian database. For
example:
To find the document containing
these two paragraphs...
...you would search for "containingthese".
I fixed it locally by adding a "dump.append(" ");" just before the
return in process_text() in myhtmlparse.cc. Thought I'd mention it to
see if anyone could put in a better/more permanent fix.
I could send a sample document that produces the error, if that helps.
--Brian Crowell
Developer, Barbnet Investments
More information about the Xapian-discuss
mailing list