[Xapian-discuss] Not separating words when parsing HTML in Omega

Crowell, Brian BCrowell at barbnet.com
Wed Feb 9 21:11:18 GMT 2011


We noticed, when indexing a Word 2007 document, that two words in
adjacent paragraphs got melded together in the Xapian database. For
example:

  To find the document containing

  these two paragraphs...

...you would search for "containingthese".

I fixed it locally by adding a "dump.append(" ");" just before the
return in process_text() in myhtmlparse.cc. Thought I'd mention it to
see if anyone could put in a better/more permanent fix.

I could send a sample document that produces the error, if that helps.

--Brian Crowell
  Developer, Barbnet Investments



More information about the Xapian-discuss mailing list