[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

Olly Betts olly at survex.com
Sun Feb 26 13:12:08 GMT 2006


On Sun, Feb 26, 2006 at 11:47:31AM +0000, James Aylett wrote:
> On Sun, Feb 26, 2006 at 12:57:51AM +0000, Olly Betts wrote:
> > * allowing more control over what QueryParser treats as a word character
> >   (and tweak the defaults to avoid generating phrase searches in cases
> >   where we don't need to - for example: 2.4.1 is currently a 3 term
> >   phrase query, and a slow case).
> 
> In this case, do we want to generate (nopos?) terms for 2.4, 2? And
> maybe other subparts? (As 2.4.1 is actually hierarchical I think
> having 2.4 and 2 would be sufficient.)

I used to think we probably should, but I'm less convinced at present.
I suspect it might do more harm than good overall.

> > * fix the $highlight command in Omega to handle utf-8 and the
> >   configurable definitions of what a word is.
> 
> Something that has occured to me recently is a combined summarise and
> highlight, so we get an effect closer to what Google does. (So if you
> stuff the entire content into an appropriate data field you can have
> unmatched bits of it elided at display time.)

Yes, that would be nice.

> Richard (or possibly me) at one point wanted configurable highlighting
> that picked each word-that-matched-a-term out in a different
> colour. We came up with a somewhat neat way of doing this (from pov of
> output sanity, rather than coding simplicity, although it wouldn't be
> terribly difficult) when we were looking at a better opensearch over
> atom.

That's what $highlight does by default since 0.9.2 (I implemented it for
gmane).  The colours repeat if the query has more than 10 terms
currently, but it would be easy to extend the list if that's a problem
for anyone.

> > Before you ask, I don't have a date for 1.0 yet.  I suspect we'll want
> > at least one more 0.9.X first, to collect up any bug fixes, especially
> > since upgrading to 1.0 will be a bigger deal than usual, because it will
> > require a reindex for many users.
> 
> I'm turning out to have small amounts of time for Xapian at the moment
> - I'm currently working on a lightweight indexer that for email (a bit
> like woodpecker or mbox2omega, but better :-). I've got enough for
> that to be useful to me, though, so if there are things I can do to
> either core or omega let me know.

A fairly self-contained task is to look at each external filter
program we use (for indexing PDFs, Microsoft Word documents, etc) and
check what encoding it outputs.  For example, "man pdftotext" reveals
that there's a "-enc" option to set the encoding, and it defaults to
"Latin1".  A bit more digging reveals that "UTF-8" is a built-in
encoding, so calling "pdftotext -enc UTF-8" is what is needed.  I
suspect it won't be so easy in general though, and some filters will
require use of "iconv" or similar.

> What did strike me as useful would be a better approach to the
> document data. Currently we can't really put blobs into the field
> values, and (for instance) in email you probably want to preserve
> newlines in the summary. I keep on almost sitting down and
> implementing XML support (so if the document data starts "<?xml" it's
> parsed out as XML), at which point I guess we want a $xpath command in
> omegascript to pull out the equivalent of fields.
>
> I'm wary about introducing a dependency on libxml2 though - is there
> are lighterweight format we could use? rdf/n3 perhaps?

XML seems overkill if the benefit is preserving blobs and newlines -
just adding support for some sort of escape sequences would be much
easier and less overhead, and XML itself ends up introducing escape
sequences anyway (for "<", ">", etc).

There are other benefits from XML for some users of course, but it's a
bit on the verbose side.  Even just the "<?xml" at the start of each
record adds up when you repeat it a hundred million times (~0.5GB)
and although we'll be compressing records soon, it's harder to compress
out redundancy between records.

0.9.3 allows you to avoid storing the fieldnames for every record.  If
you set $opt{fieldnames} to a list of the fieldnames, then you can just
store the values one-per-line in the document data in the same order.
If you want to go the XML route, it would probably be better to omit
at least the outer wrapper and require an option to be set to indicate
the document data should be parsed as XML.

Cheers,
    Olly



More information about the Xapian-discuss mailing list