我一直在阅读并测试php levenshtein 中的一些示例。 将$ input与$ words进行比较 输出 比较
$input = 'hw r u my dear angel';
// array of words to check against
$words = array('apple','pineapple','banana','orange','how are you',
'radish','carrot','pea','bean','potato','hw are you');
输出
Input word: hw r u my dear angel
Did you mean: hw are you?
比较,删除数组中的 hw 。
$input = 'hw r u my dear angel';
// array of words to check against
$words = array('apple','pineapple','banana','orange','how are you',
'radish','carrot','pea','bean','potato');
在数组输出
中第二次删除 hw是Input word: hw r u my dear angel
Did you mean: orange?
similar_text()
echo '<br/>how are you:'.similar_text($input,'how are you');
echo '<br/>orange:'.similar_text($input,'orange');
echo '<br/>hw are you:'.similar_text($input,'hw are you');
how are you:6
orange:5
hw are you:6
第二次比较为什么输出橙色 你好还有6个类似的文字,如 hw你是?有没有办法改进或更好的方法呢?同时我保存数据库上所有可能的输入。我应该查询并存储在array
中,然后使用foreach
获取levenshtein distance
?但如果有数百万人,这将是缓慢的。
CODE
<?php
// input misspelled word
$input = 'hw r u my dear angel';
// array of words to check against
$words = array('apple','pineapple','banana','orange','how are you',
'radish','carrot','pea','bean','potato','hw are you');
// no shortest distance found, yet
$shortest = -1;
$closest = closest($input,$words,$shortest);
echo "Input word: $input<br/>";
if ($shortest == 0) {
echo "Exact match found: $closest\n";
} else {
echo "Did you mean: $closest?\n";
}
echo '<br/><br/>';
$shortest = -1;
$words = array('apple','pineapple','banana','orange','how are you',
'radish','carrot','pea','bean','potato');
$closest = closest($input,$words,$shortest);
echo "Input word: $input<br/>";
if ($shortest == 0) {
echo "Exact match found: $closest\n";
} else {
echo "Did you mean: $closest?\n";
}
echo '<br/><br/>';
echo 'Similar text';
echo '<br/>how are you:'.similar_text($input,'how are you');
echo '<br/>orange:'.similar_text($input,'orange');
echo '<br/>hw are you:'.similar_text($input,'hw are you');
function closest($input,$words,&$shortest){
// loop through words to find the closest
foreach ($words as $word) {
// calculate the distance between the input word,
// and the current word
$lev = levenshtein($input, $word);
// check for an exact match
if ($lev == 0) {
// closest word is this one (exact match)
$closest = $word;
$shortest = 0;
// break out of the loop; we've found an exact match
break;
}
// if this distance is less than the next found shortest
// distance, OR if a next shortest word has not yet been found
if ($lev <= $shortest || $shortest < 0) {
// set the closest match, and shortest distance
$closest = $word;
$shortest = $lev;
}
}
return $closest;
}
?>
答案 0 :(得分:5)
首先,输出similar_text()
无关紧要,因为它使用另一种算法来计算字符串之间的相似性。
让我们试着理解为什么levenstein()
认为, hw ru亲爱的ange 更接近橙色而不是'你好吗。维基百科有Levenstein距离的good definition。
非正式地,两个单词之间的Levenshtein距离是将一个单词更改为另一个单词所需的单字符编辑(插入,删除,替换)的最小数量。
现在让我们计算一下我们需要做多少次修改才能将我亲爱的天使改为橙色。
因此需要1 + 1 + 12 + 1 = 15
次编辑才能将我亲爱的天使改为橙色。
这是 hw,我亲爱的天使转变为你好吗。
总共1 + 7 + 2 + 1 + 5 = 16
次修改。所以你可以看到Levinstein距离橙色更接近 hw r u亲爱的天使; - )
答案 1 :(得分:1)
发现了一些东西......但是使用了mysql http://www.artfulsoftware.com/infotree/queries.php#552 我认为这比在php上处理它要好得多,因为我的数据存储在数据库中。
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END;
辅助函数:
CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN
SET max_len = s1_len;
ELSE
SET max_len = s2_len;
END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END;
但我不知道为什么会导致错误You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 5