[Xapian-tickets] [Xapian] #514: Omega language detection with textcat

Xapian nobody at xapian.org
Thu Oct 31 08:02:25 GMT 2019


#514: Omega language detection with textcat
-------------------------+-------------------------------
 Reporter:  olly         |             Owner:  olly
     Type:  enhancement  |            Status:  new
 Priority:  normal       |         Milestone:  1.5.0
Component:  Omega        |           Version:  git master
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+-------------------------------
Changes (by olly):

 * version:  SVN trunk => git master
 * milestone:  1.4.x => 1.5.0


Comment:

 Seems the active libtextcat fork is libexttextcat
 (https://wiki.documentfoundation.org/Libexttextcat) - this one is packaged
 for Debian at least.

 The patch needs updating to current git master and to use this (it looks
 like the API is the same, or not very different).

 I think it would help if `Xapian::Stem`'s constructor could be told to
 treat unknown language codes as `"none"` rather than throwing an
 exception, since then we could just set a stemmer based on the detected
 language.

 We also still need a plan for handling multiple stemming languages.

 If we add the stemmed terms as `Zfoo` for each language then we can search
 unstemmed across the whole dataset, but a stemmed search needs to be
 filtered by the respective `L`-prefix term.  But this causes a stats
 contamination problem between terms in different languages unless we
 encode the language into the term prefix.

 But we could have a separate database for each language - this seems more
 satisfactory, but care is needed to handle updated documents for which the
 detected language changes, and the consequences need working through.

 I think this is too invasive for 1.4.x, so marking for 1.5.0.

--
Ticket URL: <https://trac.xapian.org/ticket/514#comment:4>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list