[Xapian-discuss] editdistance.cc average case and worst case running time compared to dynamic programming algorithm

Tue Oct 9 02:17:57 BST 2012

On Thu, Sep 06, 2012 at 01:21:51PM -0400, Frank Chang wrote:
>    Good afternoon, I have been running some timing tests with 
> retval = edit_distance_unsigned((const unsigned int *)MaryCamsterBuffer, len2,
> (const unsigned int *)MaryBuffer, len1, 2); where k =2 is the Levenshtein distance cutoff.
[...]
>  double cFuzzyCompareLevenshtein::CalcCompare(char* Str1_, char* Str2_, int Size_){

What types are MaryBuffer and MaryCamsterBuffer?

edit_distance_unsigned() expects pointers to arrays of unsigned int
(holding Unicode code points) but I notice your edit distance function
takes char *, so I'm wondering if you're simply casting char* to
unsigned int* - that's not going to work as you intend, and would likely
result in computing edit distance for two much longer strings of junk
data, which would take longer than computing the edit distance of the
intended input.

> Our results indicate that after 50,000 string pair comparisons that
> edit_distance is equal in speed to a conventional Levenshtein distance
> dynamic programming implementation Big O(N ^^ 2)  shown below.
> I would expect the running time of the Xapian edit distance
> implementation based on Hal Berghel and David Roach's Big O(linear
> running time) algorithm would be substantially faster. Could someone
> please explain how I can make this speeed improvement happen if is is
> possible? Thank you.

If it isn't the issue above, could you post the whole test harness
you're using?  It's hard to reproduce if we have to recreate that code.

Cheers,
    Olly