[Xapian-discuss] editdistance.cc average case and worst case running time compared to dynamic programming algorithm

Wed Oct 10 10:48:17 BST 2012

Mr. Olly Betts , The types of MaryCamsterBuffer and MaryBuffer are char*. I changed the edit_distance_unsigned template specialization to 

int
edit_distance_unsigned(const char * ptr1, int len1,
const char * ptr2, int len2,
int max_distance)
{
return seqcmp_editdist<char>(ptr1, len1, ptr2, len2, max_distance);
}
and I still notice no speed improvment. 

   It will take a few days to prepare a test harness for you because it contains some proprietary code.

However, Mr. Olly Betts we found a new Levenshtein distance optimization described in http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/ writen by Thomas Keller. Mr Olly Betts, if you have the time could you critique this algorithm compared to   
*
* Based on that described in:
*
* "An extension of Ukkonen's enhanced dynamic programming ASM algorithm"
* by Hal Berghel, University of Arkansas
* and David Roach, Acxiom Corporation
*
* http://berghel.net/publications/asm/asm.php
*/

Fuzzy-String Search: Find misspelled information with T-SQL

We implemented it in C++, shown below:

double cFuzzyCompareUTF8Levenshtein::Compare(void) {
int maxLen=MAX(Len1,Len2);
int abortThreshold=(double)(maxLen-floor(MaxEditThreshold*maxLen));
// No sense in continuing for short compares or when the lengths are
// pretty far apart:
if (Len1<=1 || Len2<=1 || Len1-Len2>abortThreshold || Len2-Len1>abortThreshold)
return 0;
char *minColCost=new char[maxLen+1];
memset(minColCost,0x7f,(maxLen+1)*sizeof(char));
char *distances=new char[(maxLen+1)*(maxLen+1)];
// Populate the 0th row and column of array:
for (int i=0;i<=maxLen;i++) {
distances[i]=i;
distances[i*(maxLen+1)]=i;
}
// Populate inner elements:
for (int i=1;i<=maxLen;i++) {
char minRowCost=0x7f;
for (int j=1;j<=maxLen;j++) {
// Calculate penalty that should be incurred:
char cost=(Str1[i-1]==Str2[j-1] ? 0 : 1);
// Distance at i,j is minimum of cell immediately above, immediately left,
// or cell up-left+1:
distances[j*(maxLen+1)+i]=MIN(distances[(j-1)*(maxLen+1)+i]+1,
MIN(distances[j*(maxLen+1)+i-1]+1,distances[(j-1)*(maxLen+1)+i-1]+cost));
// Keep running tally of minimum costs for this row and column:
minRowCost=MIN(minRowCost,distances[j*(maxLen+1)+i]);
minColCost[j]=MIN(minColCost[j],distances[j*(maxLen+1)+i]);
// If minimum cost of row/column exceeds the threshold, then we
// should abort, we will never get close with these two strings:
if (i==j && minRowCost>abortThreshold && minColCost[j]>abortThreshold) {

delete[] distances;
delete[] minColCost;
return 0;
}
}
}
/* DumpDistances(distances,maxLen); // For debugging... */
double retVal=(100.0*(double) (maxLen-distances[(maxLen+1)*(maxLen+1)-1])/(double) maxLen);
delete[] distances;
delete[] minColCost;
return retVal;
}

Below, I show some explanation for the C++ code, 

Here’s another example:

Ancorage versus Anchorage  (1 drop). I used ‘?’ as a filler:

So here we see the problem you mentioned, the diagonal does not contain the lowest number and you get a premature exit. In this case, if K=1, we’d prematurely stop at row=col=5. However, we can see that we can tell that there is a lower cost that might survive until the end, at row=5, col=4. So the question is how can we tailor our algorithm to see that value instead of that one on the diagonal?

I think the answer is that you need to keep track of the minimum cost of each row and column, and at the row==col location, evaluate that minimum cost against your K. For example (skipping the first 3 iterations and jumping to row=3, col=3):

Say we’ve kept track of the minimal cost for this row, and also the minimal cost for this column. Both of these values are 0, so we’re still on track. Moving to row=4, col=4:

Now, in this case our minimal cost has jumped to 1. I’ve highlighted in orange, but really it could have been any one of the three cells containing a ‘1’ in row=4 or col=4. What’s important is that it’s now 1 and not 0. Since I’ve said K=2, we can keep processing. Moving to row=5, col=5:

Same deal, our minimum cost is still 1. It’s turned out that our ‘path’ has turned a bit from the previous position, but that doesn’t matter – we’re not really drawing the path like I did in green earlier, if we stray off the ‘real’ path from time to time it is okay. 

You can probably see now that we would iterate all the way to the bottom right endpoint, which is exactly want for this example, no shortcutting.

Here’s another example, Anokage versus Anchorage. In this case, we want to see a shortcut happen, as these guys are at least 3 off:

And here we see at row=6, col=6, that our minimal cost is 3 which is more than our K (2 in this example), so we can stop. 

So I think you just need to adjust the algorithm to keep track of the rightmost column’s and bottom most row’s minimal values. When those minimums are >K, we can jump out.

> Date: Tue, 9 Oct 2012 02:17:57 +0100
> From: olly at survex.com
> To: frank_chang91 at hotmail.com
> CC: xapian-discuss at lists.xapian.org
> Subject: Re: [Xapian-discuss] editdistance.cc average case and worst case running time compared to dynamic programming algorithm
> 
> On Thu, Sep 06, 2012 at 01:21:51PM -0400, Frank Chang wrote:
> > Good afternoon, I have been running some timing tests with 
> > retval = edit_distance_unsigned((const unsigned int *)MaryCamsterBuffer, len2,
> > (const unsigned int *)MaryBuffer, len1, 2); where k =2 is the Levenshtein distance cutoff.
> [...]
> > double cFuzzyCompareLevenshtein::CalcCompare(char* Str1_, char* Str2_, int Size_){
> 
> What types are MaryBuffer and MaryCamsterBuffer?
> 
> edit_distance_unsigned() expects pointers to arrays of unsigned int
> (holding Unicode code points) but I notice your edit distance function
> takes char *, so I'm wondering if you're simply casting char* to
> unsigned int* - that's not going to work as you intend, and would likely
> result in computing edit distance for two much longer strings of junk
> data, which would take longer than computing the edit distance of the
> intended input.
> 
> > Our results indicate that after 50,000 string pair comparisons that
> > edit_distance is equal in speed to a conventional Levenshtein distance
> > dynamic programming implementation Big O(N ^^ 2) shown below.
> > I would expect the running time of the Xapian edit distance
> > implementation based on Hal Berghel and David Roach's Big O(linear
> > running time) algorithm would be substantially faster. Could someone
> > please explain how I can make this speeed improvement happen if is is
> > possible? Thank you.
> 
> If it isn't the issue above, could you post the whole test harness
> you're using? It's hard to reproduce if we have to recreate that code.
> 
> Cheers,
> Olly

Thank you , Mr Olly Betts,