我正在进行模糊搜索实现,作为实现的一部分,我们正在使用Apache的StringUtils.getLevenshteinDistance。目前,我们正在寻找模糊搜索的特定最大平均响应时间。经过各种改进和一些剖析后,花费最多时间的地方是计算Levenshtein距离。它占搜索字符串总时间的大约80-90%三个字母或更多。
现在,我知道在这里可以做些什么有一些限制,但我已经读过以前的SO问题和LD的维基百科链接,如果有人愿意将阈值限制在设定的最大距离,那可能帮助限制在算法上花费的时间,但我不确定如何准确地做到这一点。
如果我们只对此感兴趣 距离小于a 阈值k,那么就足够了 计算宽度的对角条纹 矩阵中2k + 1。就这样, 算法可以在O(kl)时间内运行, 其中l是最短的长度 字符串。[3]
下面您将看到StringUtils的原始LH代码。之后是我的修改。我试图基本上计算设定长度与i,j对角线的距离(因此,在我的例子中,i,j对角线上方和下方的两个对角线)。但是,这是不正确的,因为我已经这样做了。例如,在最高的对角线上,它总是会直接在上面选择单元格值,这将是0.如果有人能告诉我如何使这个功能如我所描述的那样,或者如何使它成为如此的一般建议, 这将不胜感激。
public static int getLevenshteinDistance(String s, String t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
int n = s.length(); // length of s
int m = t.length(); // length of t
if (n == 0) {
return m;
} else if (m == 0) {
return n;
}
if (n > m) {
// swap the input strings to consume less memory
String tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n+1]; //'previous' cost array, horizontally
int d[] = new int[n+1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
char t_j; // jth character of t
int cost; // cost
for (i = 0; i<=n; i++) {
p[i] = i;
}
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
for (i=1; i<=n; i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}
我的修改(仅限于for循环):
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
int k = Math.max(j-2, 1);
for (i = k; i <= Math.min(j+2, n); i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
答案 0 :(得分:5)
我写过关于Levenshtein自动机的文章,这是在O(n)之前进行此类检查的一种方法,here。源代码示例使用Python,但解释应该是有用的,参考文章提供了更多详细信息。
答案 1 :(得分:3)
实现窗口的问题是处理第一个条目左侧和每行最后一个条目的值。
一种方法是启动您最初填写的值为1而不是0,然后忽略您遇到的任何0。你必须从你的最终答案中减去1。
另一种方法是用高值填充第一个和最后一个左边的条目,这样最小的检查永远不会选择它们。这是我前几天必须实施的方式:
public static int levenshtein(String s, String t, int threshold) {
int slen = s.length();
int tlen = t.length();
// swap so the smaller string is t; this reduces the memory usage
// of our buffers
if(tlen > slen) {
String stmp = s;
s = t;
t = stmp;
int itmp = slen;
slen = tlen;
tlen = itmp;
}
// p is the previous and d is the current distance array; dtmp is used in swaps
int[] p = new int[tlen + 1];
int[] d = new int[tlen + 1];
int[] dtmp;
// the values necessary for our threshold are written; the ones after
// must be filled with large integers since the tailing member of the threshold
// window in the bottom array will run min across them
int n = 0;
for(; n < Math.min(p.length, threshold + 1); ++n)
p[n] = n;
Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// this is the core of the Levenshtein edit distance algorithm
// instead of actually building the matrix, two arrays are swapped back and forth
// the threshold limits the amount of entries that need to be computed if we're
// looking for a match within a set distance
for(int row = 1; row < s.length()+1; ++row) {
char schar = s.charAt(row-1);
d[0] = row;
// set up our threshold window
int min = Math.max(1, row - threshold);
int max = Math.min(d.length, row + threshold + 1);
// since we're reusing arrays, we need to be sure to wipe the value left of the
// starting index; we don't have to worry about the value above the ending index
// as the arrays were initially filled with large integers and we progress to the right
if(min > 1)
d[min-1] = Integer.MAX_VALUE;
for(int col = min; col < max; ++col) {
if(schar == t.charAt(col-1))
d[col] = p[col-1];
else
// min of: diagonal, left, up
d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
}
// swap our arrays
dtmp = p;
p = d;
d = dtmp;
}
if(p[tlen] == Integer.MAX_VALUE)
return -1;
return p[tlen];
}
答案 2 :(得分:1)
Here有人回答了一个非常相似的问题:
引用:
我做过很多次了。我这样做的方法是使用可能更改的游戏树的递归深度优先树。有一个预算k的变化,我用来修剪树。有了这个例程,首先我用k = 0运行它,然后k = 1,然后k = 2,直到我得到命中或者我不想再高一点。
的
char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
/* if the budget is exhausted, prune the search */
if (k < 0) return false;
/* if at end of both strings we have a match */
if (ia == na && ib == nb) return true;
/* if the first characters match, continue walking with no reduction in budget */
if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
/* if the first characters don't match, assume there is a 1-character replacement */
if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
/* try assuming there is an extra character in a */
if (ia < na && walk(ia+1, ib, k-1)) return true;
/* try assuming there is an extra character in b */
if (ib < nb && walk(ia, ib+1, k-1)) return true;
/* if none of those worked, I give up */
return false;
}
只是主要部分,更多原始代码
答案 3 :(得分:1)
我使用原始代码并将其放在j for循环结束之前:
if (p[n] > s.length() + 5)
break;
+5是任意的,但就我们的目的而言,如果距离是查询长度加上五(或我们确定的任何数字),那么返回的内容并不重要,因为我们认为匹配太过不同了。它确实减少了一些事情。尽管如此,如果有人能够更好地理解这一点,那么确定这不是维基声明所说的想法。
答案 4 :(得分:1)
根据“Gusfield,Dan(1997)。关于字符串,树和序列的算法:计算机科学和计算生物学”(第264页),你应该忽略零。
答案 5 :(得分:0)
Apache Commons Lang 3.4有这个实现:
/**
* <p>Find the Levenshtein distance between two Strings if it's less than or equal to a given
* threshold.</p>
*
* <p>This is the number of changes needed to change one String into
* another, where each change is a single character modification (deletion,
* insertion or substitution).</p>
*
* <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield
* and Chas Emerick's implementation of the Levenshtein distance algorithm from
* <a href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.htm</a></p>
*
* <pre>
* StringUtils.getLevenshteinDistance(null, *, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, null, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, *, -1) = IllegalArgumentException
* StringUtils.getLevenshteinDistance("","", 0) = 0
* StringUtils.getLevenshteinDistance("aaapppp", "", 8) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 7) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 6)) = -1
* StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7
* StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1
* StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7
* StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1
* </pre>
*
* @param s the first String, must not be null
* @param t the second String, must not be null
* @param threshold the target threshold, must not be negative
* @return result distance, or {@code -1} if the distance would be greater than the threshold
* @throws IllegalArgumentException if either String input {@code null} or negative threshold
*/
public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
if (threshold < 0) {
throw new IllegalArgumentException("Threshold must not be negative");
}
/*
This implementation only computes the distance if it's less than or equal to the
threshold value, returning -1 if it's greater. The advantage is performance: unbounded
distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only
computing a diagonal stripe of width 2k + 1 of the cost table.
It is also possible to use this to compute the unbounded Levenshtein distance by starting
the threshold at 1 and doubling each time until the distance is found; this is O(dm), where
d is the distance.
One subtlety comes from needing to ignore entries on the border of our stripe
eg.
p[] = |#|#|#|*
d[] = *|#|#|#|
We must ignore the entry to the left of the leftmost member
We must ignore the entry above the rightmost member
Another subtlety comes from our stripe running off the matrix if the strings aren't
of the same size. Since string s is always swapped to be the shorter of the two,
the stripe will always run off to the upper right instead of the lower left of the matrix.
As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1.
In this case we're going to walk a stripe of length 3. The matrix would look like so:
1 2 3 4 5
1 |#|#| | | |
2 |#|#|#| | |
3 | |#|#|#| |
4 | | |#|#|#|
5 | | | |#|#|
6 | | | | |#|
7 | | | | | |
Note how the stripe leads off the table as there is no possible way to turn a string of length 5
into one of length 7 in edit distance of 1.
Additionally, this implementation decreases memory usage by using two
single-dimensional arrays and swapping them back and forth instead of allocating
an entire n by m matrix. This requires a few minor changes, such as immediately returning
when it's detected that the stripe has run off the matrix and initially filling the arrays with
large values so that entries we don't compute are ignored.
See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
*/
int n = s.length(); // length of s
int m = t.length(); // length of t
// if one string is empty, the edit distance is necessarily the length of the other
if (n == 0) {
return m <= threshold ? m : -1;
} else if (m == 0) {
return n <= threshold ? n : -1;
}
if (n > m) {
// swap the two strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n + 1]; // 'previous' cost array, horizontally
int d[] = new int[n + 1]; // cost array, horizontally
int _d[]; // placeholder to assist in swapping p and d
// fill in starting table values
final int boundary = Math.min(n, threshold) + 1;
for (int i = 0; i < boundary; i++) {
p[i] = i;
}
// these fills ensure that the value above the rightmost entry of our
// stripe will be ignored in following loop iterations
Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// iterates through t
for (int j = 1; j <= m; j++) {
final char t_j = t.charAt(j - 1); // jth character of t
d[0] = j;
// compute stripe indices, constrain to array size
final int min = Math.max(1, j - threshold);
final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold);
// the stripe may lead off of the table if s and t are of different sizes
if (min > max) {
return -1;
}
// ignore entry left of leftmost
if (min > 1) {
d[min - 1] = Integer.MAX_VALUE;
}
// iterates through [min, max] in s
for (int i = min; i <= max; i++) {
if (s.charAt(i - 1) == t_j) {
// diagonally left and up
d[i] = p[i - 1];
} else {
// 1 + minimum of cell to the left, to the top, diagonally left and up
d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]);
}
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// if p[n] is greater than the threshold, there's no guarantee on it being the correct
// distance
if (p[n] <= threshold) {
return p[n];
}
return -1;
}