[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

Olly Betts olly at survex.com
Mon Feb 27 00:12:22 GMT 2006


On Sun, Feb 26, 2006 at 10:46:19PM +0000, James Aylett wrote:
> On Sun, Feb 26, 2006 at 01:12:08PM +0000, Olly Betts wrote:
> 
> > > In this case, do we want to generate (nopos?) terms for 2.4, 2? And
> > > maybe other subparts? (As 2.4.1 is actually hierarchical I think
> > > having 2.4 and 2 would be sufficient.)
> > 
> > I used to think we probably should, but I'm less convinced at present.
> > I suspect it might do more harm than good overall.
> 
> Possibly. Hmm.

It's certainly possibly to come up with cases where it'd help (e.g.
searching for: `linux 2.6 vm performance' would match documents
which only mention an actual version (like 2.6.5).  But actually
a document which doesn't mention 2.6 explicitly isn't likely to
be a good match for that sort of query.

Another instance of such "decimal strings" is section numbers in a
document - for example, discussions referring to language standards
will often talk refer to a particular section as say `6.4.2'.  Here
6.4 and 6 aren't interesting terms at all really.

> > That's what $highlight does by default since 0.9.2 (I implemented it for
> > gmane).  The colours repeat if the query has more than 10 terms
> > currently, but it would be easy to extend the list if that's a problem
> > for anyone.
> 
> How do you control the colours? We were going to have a mechanism
> which allowed you to use CSS classes, and specify how many you wanted.

It just has 10 hex values hardcoded (the same 10 which the old Gmane
search had, which seem to be similar to what Google uses).  It wouldn't
be hard to use CSS classes instead - instead of:

<b style="color:black;background-color:#ffff66"> ... </b>

Produce:

<b class="omegahl1"> ... </b>

If we use something like <b> ... </b> rather than <span> ... </span>
then we get graceful degradation if there's no CSS support, or if the
omegahl<n> classes aren't defined in the stylesheet.

> > A fairly self-contained task is to look at each external filter
> > program we use (for indexing PDFs, Microsoft Word documents, etc)
> > and check what encoding it outputs [...] some filters will require
> > use of "iconv" or similar.
> 
> I thought we'd agreed on ILU for our charset conversions? (Is it ILU
> or ICU? Or some other TLA :-)

ICU comes under "similar"!

> > 0.9.3 allows you to avoid storing the fieldnames for every record.  If
> > you set $opt{fieldnames} to a list of the fieldnames, then you can just
> > store the values one-per-line in the document data in the same order.
> 
> That's neat. I wonder if we could do something that combined with that
> to allow blobs with very little structural overhead. It would be nice
> to (a) keep the lengthing intrinsic to the doc data, and (b) enable
> intrinsic field names for people who want to work that way.

Erm, what's a "lengthing intrinsic"?

Are you suggesting that we should store the length of each field value
before the field value instead of having one per line?  Sort of like
pascal strings rather than C strings.

That'd work fairly well if we used the variable length encoding for
integers, so field lengths 0-127 would only need a byte and it would
be only a little less compact than using a newline terminated value.

> > If you want to go the XML route, it would probably be better to omit
> > at least the outer wrapper and require an option to be set to indicate
> > the document data should be parsed as XML.
> 
> An option where?

I meant an OmegaScript option, i.e. one set by $set and read using $opt.
So something like $set{xmlfields,1} would tell Omega to expect XML fields.

Cheers,
    Olly



More information about the Xapian-discuss mailing list