[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

Sun Feb 26 11:47:31 GMT 2006

On Sun, Feb 26, 2006 at 12:57:51AM +0000, Olly Betts wrote:

> Omega's scriptindex indexer is a good fit for what you want to do
> [... ]currently iso-8859-1 input is assumed by the word splitting
> though.

Note that I've got the word splitter from omega (more or less) in
python. Anyone wanting to index UTF-8 would find that an easier base
than trying to do it in PHP, I imagine, because python has a UTF-8
capable core (PHP probably won't until 6.0).

Obviously, as Olly says, it would be preferable to have xapian ship
with a reasonable UTF-8 index assistant.

> * allowing more control over what QueryParser treats as a word character
>   (and tweak the defaults to avoid generating phrase searches in cases
>   where we don't need to - for example: 2.4.1 is currently a 3 term
>   phrase query, and a slow case).

In this case, do we want to generate (nopos?) terms for 2.4, 2? And
maybe other subparts? (As 2.4.1 is actually hierarchical I think
having 2.4 and 2 would be sufficient.)

> * fix the $highlight command in Omega to handle utf-8 and the
>   configurable definitions of what a word is.

Something that has occured to me recently is a combined summarise and
highlight, so we get an effect closer to what Google does. (So if you
stuff the entire content into an appropriate data field you can have
unmatched bits of it elided at display time.)

Richard (or possibly me) at one point wanted configurable highlighting
that picked each word-that-matched-a-term out in a different
colour. We came up with a somewhat neat way of doing this (from pov of
output sanity, rather than coding simplicity, although it wouldn't be
terribly difficult) when we were looking at a better opensearch over
atom.

Neither of them is terribly important IMHO though.

> Before you ask, I don't have a date for 1.0 yet.  I suspect we'll want
> at least one more 0.9.X first, to collect up any bug fixes, especially
> since upgrading to 1.0 will be a bigger deal than usual, because it will
> require a reindex for many users.

I'm turning out to have small amounts of time for Xapian at the moment
- I'm currently working on a lightweight indexer that for email (a bit
like woodpecker or mbox2omega, but better :-). I've got enough for
that to be useful to me, though, so if there are things I can do to
either core or omega let me know.

What did strike me as useful would be a better approach to the
document data. Currently we can't really put blobs into the field
values, and (for instance) in email you probably want to preserve
newlines in the summary. I keep on almost sitting down and
implementing XML support (so if the document data starts "<?xml" it's
parsed out as XML), at which point I guess we want a $xpath command in
omegascript to pull out the equivalent of fields.

I'm wary about introducing a dependency on libxml2 though - is there
are lighterweight format we could use? rdf/n3 perhaps?

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org