Question

我正在尝试在一组字段中找到接近重复的值，以便管理员清理它们。

我在

上有两个匹配的标准

一根弦完全包含在另一根中，并且至少是其长度的1/4
字符串的编辑距离小于两个字符串总长度的5％

Pseudo-PHP代码：

foreach($values as $value){
$matches = array();
foreach($values as $match){
  if(
    (
      $value['length'] < $match['length']
      &&
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      $match['length'] < $value['length']
      &&
      $match['length'] * 4 > $value['length']
      &&
      stripos($value['value'], $match['value']) !== false
    )
    ||
    (
      abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($match['changes'] = levenshtein($value['value'], $match['value']))
      &&
      $match['changes'] * 20 <= ($value['length'] + $match['length'])
      )
    ){
      $matches[] = &$match;
    }
}
// output matches for current outer loop value
}

我尽可能减少对相对昂贵的stripos和levenshtein函数的调用，这大大缩短了执行时间。但是，作为O（n ^ 2）操作，这不会扩展到更大的值集，并且似乎花费了大量的处理时间来简单地遍历数组。

在

Total   | Strings      | # of matches per string |          |
Strings | With Matches | Average | Median |  Max | Time (s) |
--------+--------------+---------+--------+------+----------+
    844 |          413 |     1.8 |      1 |   58 |    140   |
    593 |          156 |     1.2 |      1 |    5 |     62   | 
    272 |          168 |     3.2 |      2 |   26 |     10   |
    157 |           47 |     1.5 |      1 |    4 |      3.2 |
    106 |           48 |     1.8 |      1 |    8 |      1.3 |
     62 |           47 |     2.9 |      2 |   16 |      0.4 |

我还能做些什么来减少检查标准的时间，更重要的是我有什么方法可以减少所需的标准检查次数（例如，通过预处理输入值），因为选择性如此之低？

编辑：已实施的解决方案

// $values is ordered from shortest to longest string length
$values_count = count($values); // saves a ton of time, especially on linux
for($vid = 0; $vid < $values_count; $vid++){
for($mid = $vid+1; $mid < $values_count; $mid++){ // only check against longer strings
  if(
    (
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      ($match['length'] - $value['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($changes = levenshtein($value['value'], $match['value']))
      &&
      $changes * 20 <= ($value['length'] + $match['length'])
      )
    ){
      // store match in both directions
      $matches[$vid][$mid] = true;
      $matches[$mid][$vid] = true;
    }

}
}
// Sort outer array of matches alphabetically with uksort()
foreach($matches as $vid => $mids){
  // sort inner array of matches by usage count with uksort()
  // output matches
}

Answer 1

您可以先按长度（O（N））对字符串进行排序，然后只检查较小的字符串为子字符串或较大的字符串，另外只检查字符串对中的levenshtein，其差异不会太大。

您已经执行了这些检查，但是现在您对所有N x N对进行检查，而首先按长度进行预选将帮助您减少要检查的对。避免使用N x N循环，即使它只包含失败的测试。

对于子字符串匹配，您可以通过为所有较小的项创建索引来进一步改进，并在解析较大的项时相应地更新它。索引应该可以在字母上形成分支的树结构，其中每个单词（字符串）形成从根到叶的路径。通过这种方式，您可以查找索引中的任何单词是否与要匹配的某些字符串进行比较。对于匹配字符串中的每个字符，尝试继续树索引中的任何指针，并在索引处创建新指针。如果指针无法继续处理索引中的后续字符，则将其删除。如果任何指针到达叶子音符，则表示您已找到子字符串匹配。我认为，实现这一点并不困难，但也不是微不足道的。

Answer 2

通过收紧内循环，您可以立即获得100％的提升。你的结果中没有重复匹配吗？

对于预处理步骤，我会经历并计算字符频率（假设你的字符集很小，就像a-z0-9，考虑到你正在使用stripo，我认为很可能）。然后而不是比较序列（昂贵）比较频率（便宜）。这会给你误报，你可以忍受，或插入你目前要淘汰的测试。

优化近似重复值搜索

2 个答案: