Serbian language

Olly Betts olly at survex.com
Thu Nov 17 03:37:15 GMT 2016


On Wed, Nov 16, 2016 at 07:03:31PM +0100, Aleksandar Pavic wrote:
> I am interested for adding Serbian language as a language for stemming.

To incorporate a stemmer in Xapian, it needs to be:

  * suitably licensed - MIT/X or BSD (2 or 3 clause) or similar, so that
    incorporating it doesn't block relicensing
  * written in either Snowball (https://snowballstem.org/) or C/C++
  * accompanied by a vocabulary list with matching stems, and that also needs
    to be suitably licensed (though we can probably allow GPL-ed wordlists).

It's good if we can be confident that the algorithm works well, as changing
it later results in incompatible searching of existing databases.  For
example, something based on a peer reviewed paper with details of evaluation.

It would also be good to have references to any papers, etc the stemmer
aims to implement and any intentional points of deviation and the reasons
for them (otherwise someone will later report that the stemmer doesn't
follow the paper and it'll be hard to know if that's intended).

> I'm interested what kind of development/database is required to do that,
> and I can maybe include some people from university, etc...

There are some existing stemmers for Serbian, e.g. this one in Python:

https://github.com/nikolamilosevic86/SerbianStemmer

I don't see an explicit licence stated there, but you could ask the author
to actually specify one if it seems a suitable starting point.

Cheers,
    Olly



More information about the Xapian-devel mailing list