[Xapian-tickets] [Xapian] #514: Omega language detection with textcat
Xapian
nobody at xapian.org
Thu Oct 31 08:02:25 GMT 2019
#514: Omega language detection with textcat
-------------------------+-------------------------------
Reporter: olly | Owner: olly
Type: enhancement | Status: new
Priority: normal | Milestone: 1.5.0
Component: Omega | Version: git master
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+-------------------------------
Changes (by olly):
* version: SVN trunk => git master
* milestone: 1.4.x => 1.5.0
Comment:
Seems the active libtextcat fork is libexttextcat
(https://wiki.documentfoundation.org/Libexttextcat) - this one is packaged
for Debian at least.
The patch needs updating to current git master and to use this (it looks
like the API is the same, or not very different).
I think it would help if `Xapian::Stem`'s constructor could be told to
treat unknown language codes as `"none"` rather than throwing an
exception, since then we could just set a stemmer based on the detected
language.
We also still need a plan for handling multiple stemming languages.
If we add the stemmed terms as `Zfoo` for each language then we can search
unstemmed across the whole dataset, but a stemmed search needs to be
filtered by the respective `L`-prefix term. But this causes a stats
contamination problem between terms in different languages unless we
encode the language into the term prefix.
But we could have a separate database for each language - this seems more
satisfactory, but care is needed to handle updated documents for which the
detected language changes, and the consequences need working through.
I think this is too invasive for 1.4.x, so marking for 1.5.0.
--
Ticket URL: <https://trac.xapian.org/ticket/514#comment:4>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list