Question

我正忙于编写一个简单的算法来模拟来自两个数据集的地址。我正在计算两个地址之间的levenshtein距离，然后将完全匹配或最短匹配添加到匹配的数组中。

然而，这非常慢，因为在最坏的情况下，它必须将每个旧地址与每个新地址进行比较。

我目前的解决方案如下：

matches = [];
foreach ($classifications as $classification)
{
    $classification = $stringMatchingService->standardize($classification, $stringMatchingService->isClassification());
    $shortest = -1;
    $closest = '';
    $cnt = 0;
    foreach ($lines as $line)
    {
        $line = $stringMatchingService->standardize($line, $stringMatchingService->isRTT());
        if ($classification[CLASSIFICATION_POSTCODE] != $line[RTT_POSTCODE]) {
            continue;
        }

        $lev = levenshtein($classification[CLASSIFICATION_SUBURB], $line[RTT_SUBURB]);
    if ($lev == 0) {
        $matches[$classification[CLASSIFICATION_SUBURB]] = $line[RTT_SUBURB];
        $cnt++;
        break;
    }

    if ($lev <= $shortest || $shortest < 0) {
        //set the closest match and distance
        $closest = $line[RTT_SUBURB];
        $shortest = $lev;
    }

    if ($cnt == count($lines)) {
        $matches[$classification[CLASSIFICATION_SUBURB]] = $closest;
    }
    $cnt++;
}

}
print_r(count($matches));

请注意，standardize函数只是通过删除无关信息和填充邮政编码来尝试标准化地址。

我想知道如何加快速度，因为目前它非常昂贵，或者是否有更好的方法可以采取？

感谢任何帮助，

谢谢！

修改 $ classifications的大小为 12000行，$ lines的大小为 17000行。标准化功能如下：

 public function standardize($line, $dataSet)
{
    switch ($dataSet) {
        case self::CLASSIFICATIONS:
            if (!isset($line[9], $line[10]) || empty($line[9]) || empty($line[10])) {
                continue;
            }
            $suburb = $line[9];
            $suburb = strtoupper($suburb);
            $suburb = str_replace('EXT', '', $suburb);
            $suburb = str_replace('UIT', '', $suburb);
            $suburb = preg_replace('/[0-9]+/', '', $suburb);

            $postCode = $line[10];
            $postCode = str_pad($postCode, 4,'0', STR_PAD_LEFT);
            $line[9] = $suburb;
            $line[10] = $postCode;
            return $line;
        case self::RTT:
            if (!isset($line[1], $line[0]) || empty($line[1]) || empty($line[0])) {
                continue;
            }
            $suburb = $line[1];
            $suburb = strtoupper($suburb);
            $suburb = str_replace('EXT', '', $suburb);
            $suburb = str_replace('UIT', '', $suburb);
            $suburb = preg_replace('/[0-9]+/', '', $suburb);
            $postCode = $line[0];
            $postCode = str_pad($postCode, 4,'0', STR_PAD_LEFT);
            $line[1] = $suburb;
            $line[0] = $postCode;
            return $line;
    }

它的目的只是为了适当地访问数据并删除某些关键字，并填写邮政编码，如果它不是格式XXXX。

Answer 1

此问题适用于每个$classifications行，您可以检查$line中的行是否匹配。 = 12000 * 17000 ......

所以，我不知道数组的结构，但你可以想象使用array_filter。

$matches = array_filter($classifications, function ($entry) use ($lines) {

    foreach ($lines as $line)
    {
        $lev = levenshtein($entry[CLASSIFICATION_SUBURB], $line[RTT_SUBURB]);

        // if match, return true
    }

});

$matches将是一系列匹配的行。

这取决于您的数据结构，但更好的方法是使用array_merge加上array_unique

Answer 2

你对levenshtein距离算法使用了什么宽容？根据我的经验，少于0.8会返回太多错误的匹配。我最终使用手动修正短字，例如raod = road，否则得分将是1个字符错误，使其成为75％匹配。我找到了12 tests to find addresses using fuzzy matching的文章，可能对改进算法有用。例子包括：

拼写错误
缺少空间
错误类型（街道与道路）
毗邻/附近郊区
缩写
同义词：Floor vs Level
单位，公寓或公寓与信件
号码与信件
额外词汇（例如前门，部门名称）
Swapped Letters
听起来像
令牌化（不同的输入顺序

模糊匹配地址

2 个答案: