[Xapian-discuss] UTF8 support plans (without stemming)

Thu Apr 28 13:09:42 BST 2005

On Thu, Apr 28, 2005 at 11:08:28AM +0400, Alexandre wrote:

> To be honest I didn't dig inside library, I just believe in bug
> report... =) Anyway, usually, when application/library was developed
> to support only one language (american/english) it's very hard to
> make it work with other languages (for example, with russian) -
> there are lots of problems inside...

To clarify: there is nothing in the main parts of Xapian (indexing and
query) that presumes anything about what you're stuffing in
there. There may (I can't remember) be some practical issues about
putting NUL bytes in there (so if you want pure binary terms, you may
have to jump through some hoops), but UTF-8 doesn't cause problems.

You could, for instance, use UTF-8 from the Python bindings (as a
simple example) by doing something like:

----------------------------------------------------------------------
doc = xapian.Document()
doc.add_term(u"hi".encode('utf-8'))
----------------------------------------------------------------------

You have to do the encode('utf-8') bit because the Python bindings
take a Python string, rather than a Unicode string. This could be
wrapped in a convenience function, or for that matter could be
supported directly in the bindings if necessary.

The two bits of the Xapian core that don't currently support UTF-8 are
the stemmers and the QueryParser (and the QueryParser relies on the
stemmers). The stemmers are based on Snowball, and I think there are
plans still waiting to come to fruition to make that UTF-8 capable.

They're in the library on the basis that a lot of people need
them. However you can do probabilistic IR /without/ stemming, which
some people recommend anyway.

> I'm not a an expert, to have any moral rights to say, that I strongly 
> believe, that 'probabilistic IR' is kind of outdated.
> I just suppose, that computer can work well with lots of data, while 
> human brain can make some sort of decisions. No, I'm not for boolean 
> search, but I just didn't like probabilistic approach too much (when 
> machine tries to be smart)... I can (and probably is) absolutely wrong, 
> that's why I interested why people choose such approach.

Xapian is probabilistic because of its history [1]. Also, in practice
people are using probabilistic IR quite happily in the real
world.

You seem to suggest that you'd be happier moving the decision system
(such as ranking) out of the computer and into people. The problem I
have with that is that if you're searching for something very common,
it's very difficult for people to deal with that much data. There are
mixed approaches, some using probabilistic IR, that allow the user to
guide the ranking and grouping processes, but they tend to be harder
to use for most people (at least without training). I suspect that the
first person who comes up with a really good interface to that sort of
thing will be able to change the way we deal with information. (It'll
probably be Google though, on the basis of the number of HCI and IR
people they have :-)

Others may have different opinions. :)

[1] http://xapian.org/history.php

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org