[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

Sun Feb 26 22:46:19 GMT 2006

On Sun, Feb 26, 2006 at 01:12:08PM +0000, Olly Betts wrote:

> > In this case, do we want to generate (nopos?) terms for 2.4, 2? And
> > maybe other subparts? (As 2.4.1 is actually hierarchical I think
> > having 2.4 and 2 would be sufficient.)
> 
> I used to think we probably should, but I'm less convinced at present.
> I suspect it might do more harm than good overall.

Possibly. Hmm.

> > Richard (or possibly me) at one point wanted configurable highlighting
> > that picked each word-that-matched-a-term out in a different
> > colour.
> 
> That's what $highlight does by default since 0.9.2 (I implemented it for
> gmane).  The colours repeat if the query has more than 10 terms
> currently, but it would be easy to extend the list if that's a problem
> for anyone.

How do you control the colours? We were going to have a mechanism
which allowed you to use CSS classes, and specify how many you wanted.

> A fairly self-contained task is to look at each external filter
> program we use (for indexing PDFs, Microsoft Word documents, etc)
> and check what encoding it outputs [...] some filters will require
> use of "iconv" or similar.

I thought we'd agreed on ILU for our charset conversions? (Is it ILU
or ICU? Or some other TLA :-)

> > I'm wary about introducing a dependency on libxml2 though - is there
> > are lighterweight format we could use? rdf/n3 perhaps?
> 
> XML seems overkill if the benefit is preserving blobs and newlines -
> just adding support for some sort of escape sequences would be much
> easier and less overhead, and XML itself ends up introducing escape
> sequences anyway (for "<", ">", etc).

I suspect <> are less common than \n, but & might come up enough for
it to shift the balance towards the middle. Certainly XML is overkill
if we only want to have blobs as field values.

> There are other benefits from XML for some users of course, but it's a
> bit on the verbose side.  Even just the "<?xml" at the start of each
> record adds up when you repeat it a hundred million times (~0.5GB)
> and although we'll be compressing records soon, it's harder to compress
> out redundancy between records.

I'd be happy with a per-db field, but I was thinking personally about
something that didn't need me to poke around at that level. It could
be done extrinsically, but that seems unpleasant.

> 0.9.3 allows you to avoid storing the fieldnames for every record.  If
> you set $opt{fieldnames} to a list of the fieldnames, then you can just
> store the values one-per-line in the document data in the same order.

That's neat. I wonder if we could do something that combined with that
to allow blobs with very little structural overhead. It would be nice
to (a) keep the lengthing intrinsic to the doc data, and (b) enable
intrinsic field names for people who want to work that way.

> If you want to go the XML route, it would probably be better to omit
> at least the outer wrapper and require an option to be set to indicate
> the document data should be parsed as XML.

An option where? We can reduce <?xml version="1.0"?> to two characters
by using a BOM, as I can't imagine anyone on earth wanting a field
name that starts either [0xff] [0xfe] or vice versa (and they aren't
likely to start field vlaues either). Better would probably be a
db-level option - do we have anything like this? ISTR some discussions
a while ago about versioning, but absent any better memory I kind of
assume this was at the binary level, ie quartz of flint storage
versioning rather than xapian versioning.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org