[Xapian-discuss] Multiple synonym expanders
Jean-Francois Dockes
jf at dockes.org
Fri Aug 24 15:28:41 BST 2012
Hi,
I wonder if the Xapian developers have any plan to support multiple
independantly usable synonym expanders ?
This could be useful for a number of usages which may need to be switched
on/off independantly:
- Expansion to similar meanings (current supposed usage)
- Expansion to terms that have the same stem
- Expansion to terms differing by character case only
- Expansion to terms differing by diacritics only
- Etc.
Especially, switchable case-sensitivity is a recurrent user request.
I was thinking of recycling the current mechanism I use for stem expansion
(a key-value store based on a separate Xapian index), but a real solution
integral to Xapian would be nicer.
While experimenting I tried to modify the current mechanism with the
following approaches, and compare the creation times and resulting size:
- stem as unique term, expansions in document record (reference)
- stem as unique term, expansions as additional terms. 50% bigger,
slightly slower.
- gdbm: stem as key, expansions as data: bigger, slower.
- stem as term, expansions as synonyms: almost 3 times faster and
4 times smaller. Wow !
The performance and size of this index is not critical, but the synonyms
approach is faster, easier to use and yields a smaller db. Hhmm... hard
choice :)
If I can only ever expect one synonym db, I guess I could also hack
something based on prefixes inside the current synonyms mechanism ? Does
this sound like a good approach ?
Any idea that you can share on the subject would brighten my lantern and be
appreciated...
Cheers,
J.F. Dockes
More information about the Xapian-discuss
mailing list