[Xapian-discuss] Multiple synonym expanders

Jean-Francois Dockes jf at dockes.org
Fri Aug 24 15:28:41 BST 2012


Hi,

I wonder if the Xapian developers have any plan to support multiple
independantly usable synonym expanders ? 

This could be useful for a number of usages which may need to be switched
on/off independantly:

 - Expansion to similar meanings (current supposed usage)
 - Expansion to terms that have the same stem
 - Expansion to terms differing by character case only
 - Expansion to terms differing by diacritics only
 - Etc.

Especially, switchable case-sensitivity is a recurrent user request.

I was thinking of recycling the current mechanism I use for stem expansion
(a key-value store based on a separate Xapian index), but a real solution
integral to Xapian would be nicer.

While experimenting I tried to modify the current mechanism with the
following approaches, and compare the creation times and resulting size:
 - stem as unique term, expansions in document record (reference)
 - stem as unique term, expansions as additional terms. 50% bigger,
   slightly slower.
 - gdbm: stem as key, expansions as data: bigger, slower.
 - stem as term, expansions as synonyms: almost 3 times faster and
   4 times smaller. Wow !

The performance and size of this index is not critical, but the synonyms
approach is faster, easier to use and yields a smaller db. Hhmm... hard
choice :)

If I can only ever expect one synonym db, I guess I could also hack
something based on prefixes inside the current synonyms mechanism ? Does
this sound like a good approach ?

Any idea that you can share on the subject would brighten my lantern and be
appreciated...

Cheers,

J.F. Dockes



More information about the Xapian-discuss mailing list