[Xapian-discuss] Indexing specific data.(Help required)

Olly Betts olly at survex.com
Wed Feb 21 11:09:03 GMT 2007


On Wed, Feb 21, 2007 at 04:06:30PM +0530, Gupteshwar Joshi wrote:
> But problem is that my data is in Devnagari script and I use UTF-8 encoding
> for it's support.
> By applying some scripting called souindics which extracts sound code
> out of the word and store it with in English letters.
> So, I have to process first with above step and then with xapian
> php-binding to index and search .
> But in this process my original document gets besides and result appears as
> my sound code instead.

Note that there's no reason why you can't store the UTF-8 Devnagari
script version in the document data, but generate terms from the
anglicised version.

It occurs to me that this "sounindics" is a term normalisation procedure,
so it's a lot like a stemming algorithm in many ways.  I can't find
any information on Google about it though - "sounindics" has no matches
and "sounindic" only finds a passing mention in someone's CV (a
reference to using Xapian in fact!)

But perhaps this algorithm should be wrapped as a Xapian::Stem class,
which would make it very easy to index and query Devnagari script
in this way.  Do you have a reference for it?

> So, is there any thing by which I can maintain the reference to original
> document?.

An alternative approach is to store a unique id for the postgres
database record in the Xapian document data.

> Can it be possible to index only specific column of the csv?

Erm, of course.  Just only generate terms from that column!

Cheers,
    Olly



More information about the Xapian-discuss mailing list