[Xapian-discuss] editdistance.cc average case and worst case running time compared to dynamic programming algorithm

Frank Chang frank_chang91 at hotmail.com
Thu Sep 6 18:21:51 BST 2012


   Good afternoon, I have been running some timing tests with 
retval = edit_distance_unsigned((const unsigned int *)MaryCamsterBuffer, len2,
(const unsigned int *)MaryBuffer, len1, 2); where k =2 is the Levenshtein distance cutoff.
    Our results indicate that after 50,000 string pair comparisons that edit_distance is equal in speed to a conventional Levenshtein distance dynamic programming implementation Big O(N ^^ 2)  shown below.
    I would expect the running time of the Xapian edit distance implementation based on Hal Berghel and David Roach's Big O(linear running time) algorithm would be substantially faster. Could someone please explain how I can make this speeed improvement happen if is is possible? Thank you.
 
 double cFuzzyCompareLevenshtein::CalcCompare(char* Str1_, char* Str2_, int Size_){
  int i,j;
 int Len1,Len2;
 int *Distances;
 int Cost,Distance;
 int maxLen;
 // The Levenshtein distance is calculated by creating a 2D array that looks
  //   like the following (say Str1=GUMBO, Str2=GAMBOL)
 // Step 1 + 2:
 //     G U M B O
 //   0 1 2 3 4 5
 // G 1 0 1 2 3 4
 // A 2 1 1 2 3 4
 // M 3 2 2 1 2 3
 // B 4 3 3 2 1 2
 // O 5 4 4 3 2 1
 // L 6 5 5 4 3 2 <- Step 3
 if (memcmp(Str1_,Str2_,Size_)==0)
  return 100.0;
 for (Len1=Size_-1;Len1>0 && Str1_[Len1]==' ';Len1--)
  ;
 for (Len2=Size_-1;Len2>0 && Str2_[Len2]==' ';Len2--)
  ;
 Len1++;
 Len2++;
 if (MIN(Len1,Len2)<=1)
  return 0.0;
 Distances=new int[(Len1+1)*(Len2+1)]; //Original
 maxLen = MAX(Len1, Len2);
   
 // Step 1: Populate 0th row and column of array:
 for (i=0;i<=Len1;i++)
  Distances[i]=i;
 for (i=0;i<=Len2;i++)
  Distances[i*(Len1+1)]=i;
 // Step 2: Populate inner elements:
 for (i=1;i<=Len1;i++) { 
  for (j=1;j<=Len2;j++) {
   Cost=(Str1_[i-1]==Str2_[j-1] ? 0 : 1);
   Distances[j*(Len1+1)+i]=MIN(Distances[(j-1)*(Len1+1)+i]+1,
    MIN(Distances[j*(Len1+1)+i-1]+1,Distances[(j-1)*(Len1+1)+i-1]+Cost));
  }
 }
 // Step 3: Distance is the bottom right element:
 Distance=Distances[(Len2+1)*(Len1+1)-1];
 delete[] Distances;
  // Step 3: Number of errors is bottom right-most element, we'll
 //   convert it into number of correct and make a percentage out of it:
  return (100.0 * (double)(maxLen - Distance) / (double)maxLen);
}
 		 	   		  


More information about the Xapian-discuss mailing list