[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

James Aylett james-xapian at tartarus.org
Mon Feb 27 09:57:27 GMT 2006

On Mon, Feb 27, 2006 at 12:12:22AM +0000, Olly Betts wrote:

[decimal strings]
> It's certainly possibly to come up with cases where it'd help (e.g.
> searching for: `linux 2.6 vm performance' would match documents
> which only mention an actual version (like 2.6.5).  But actually
> a document which doesn't mention 2.6 explicitly isn't likely to
> be a good match for that sort of query.
> Another instance of such "decimal strings" is section numbers in a
> document - for example, discussions referring to language standards
> will often talk refer to a particular section as say `6.4.2'.  Here
> 6.4 and 6 aren't interesting terms at all really.

Yes, I think you've convinced me it isn't worth it. At least not until
someone comes up with a case that needs it :-)

> <b class="omegahl1"> ... </b>

That's exactly the sort of thing I was thinking of. If you could $set
an option to say how many of them you'd created, that'd be fantastic.

> > > some filters will require use of "iconv" or similar.
> > I thought we'd agreed on ILU for our charset conversions? (Is it ILU
> > or ICU? Or some other TLA :-)
> ICU comes under "similar"!

Yes, I'm not arguing with that! I meant more that I remembered
discussions where I thought we'd agreed on that, so if I do dive off
and do anything on that I use the right library.

> > That's neat. I wonder if we could do something that combined with that
> > to allow blobs with very little structural overhead. It would be nice
> > to (a) keep the lengthing intrinsic to the doc data, and (b) enable
> > intrinsic field names for people who want to work that way.
> Erm, what's a "lengthing intrinsic"?

It isn't anything. I want to keep the lengthing data (how long the
fields are) intrinsic to the Document data.

> Are you suggesting that we should store the length of each field value
> before the field value instead of having one per line?  Sort of like
> pascal strings rather than C strings.
> That'd work fairly well if we used the variable length encoding for
> integers, so field lengths 0-127 would only need a byte and it would
> be only a little less compact than using a newline terminated value.

I'd be quite happy with that.

> > > If you want to go the XML route, it would probably be better to omit
> > > at least the outer wrapper and require an option to be set to indicate
> > > the document data should be parsed as XML.
> > 
> > An option where?
> I meant an OmegaScript option, i.e. one set by $set and read using $opt.
> So something like $set{xmlfields,1} would tell Omega to expect XML fields.

I don't like this. OmegaScript shouldn't tell omega how to use the
database, it should only tell omega how to map the query to the
database. I think that requiring an OmegaScript option, which is
extrinsic to the database, is too much coupling between omega and
xapian. If we had some way of doing it intrinsically then it wouldn't
feel so omega-specific.


  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org

More information about the Xapian-discuss mailing list