Question

假设我有一个带有以下条目的ArrayList aList1： Mohandas Gandhi， M甘地，马丁路德金，金钟， Abrahm Lincln，金正

假设我有另一个具有所有正确名称的ArrayList aList2。如何将aList1中的每个项目与aList2匹配？

我希望最终的输出是Mohandas Gandhi，Mohandas Gandhi，Martin Luther King，Kim Jong，Abraham Lincoln，Kim Jong。

输出应该没有拼写错误。我如何匹配这两个词？如果我可以匹配两个单词，那么我可以使用Edit distance将一个单词转换为另一个单词。

我需要用Java编写代码。

Answer 1

这样的事情应该让你开始：

String[] incorrectNames = "Mohandas Gandhi, M Gandhi, Martin Luther King, Kim Jong, Abrahm Lincln, Kim Jng".split(", ");
String[] dictionary = "Mohandas Gandhi, Martin Luther King, Kim Jong, Abraham Lincoln".split(", ");

List<String> correctedNames = new ArrayList<>();
for (String incorrectName : incorrectNames) {
    int distance = Integer.MAX_VALUE;
    String closestMatch = null;
    for (String correctName : dictionary) {
        int currentDistance = levenshteinDistance(incorrectName, correctName);
        if (distance > currentDistance) {
            distance = currentDistance;
            closestMatch = correctName;
        }
    }
    correctedNames.add(closestMatch);
}

return correctedNames;

您当然需要levenshteinDistance的实施。其他注意事项：算法为O(m*n)，其中m是字典的大小，n是要更正的名称数，Levenshtein距离可能这样做太简单了。

Answer 2

如何处理此类问题？

虽然我认为可能有很多方法可以做到，但我认为最简单的方法是确定相似性或模糊测试。使用Levenshtein distance，您可以确定更改一个字符串以转换为另一个字符串所需的百分比。然后，您可以应用过滤器来确定哪个过滤器接近。就我而言，我使用了apache commons StringUtils来给我编辑距离值。

编辑： 可以设置哪种过滤条件？

您必须设置自己的过滤规则。

忽略百分比匹配小于50％的结果。
将百分比视为普通数字，并将它们相加以在搜索项和文档之间创建“完全匹配”。

在下面的示例代码中，我只使用了＃1。虽然，我没有忽略任何事情，因为数据点匹配。

但想象一下，“洗钱”和“金钱事实”，用户搜索“Mony Lawndaring”。

Word 1 Word 2 匹配

Mony Money 80％

Mony洗钱10％

拖欠钱10％

草坪洗钱80％

“Mony Lawndaring”与“洗钱”的总比赛是160。

Mony Lawndaring与Money Facts

Word 1 Word 2 匹配

Mony Money 80％

Mony Facts 0％

拖欠钱10％

仲裁事实10％

“Mony Lawndaring”与“Money Facts”的总比赛是80。

<强>结果

然后，只需按最高总匹配排序搜索结果，我们就会在搜索结果列表中的“Money Facts”之前看到“洗钱”文档！

package algorithms;

import org.apache.commons.lang.*;
import java.util.ArrayList;
import java.util.Arrays;

public class LevenshteinDistance {

    public static double similarity(String s1, String s2) {
          String longer = s1, shorter = s2;
          if (s1.length() < s2.length()) { // longer should always have greater length
            longer = s2; shorter = s1;
          }
          int longerLength = longer.length();
          if (longerLength == 0) { return 1.0; }
          int distance = StringUtils.getLevenshteinDistance(longer, shorter);
          return (longerLength - distance) / (double) longerLength;
    }

    public static void main(String[] args) {
        ArrayList<String> aList1 = new ArrayList<String>(Arrays.asList("Mohandas Gandhi", "M Gandhi", "Martin Luther King", "Kim Jong", "Abrahm Lincln", "Kim Jng"));
        ArrayList<String> aList2 = new ArrayList<String>(Arrays.asList("Mohandas Gandhi", "Martin Luther King", "Kim Jong", "Abrahm Lincln"));
        ArrayList<String> output = new ArrayList<String>();

        for (String incorrect : aList1){
            for (String correct : aList2){
                if (similarity(incorrect, correct) > 0.5){
                    System.out.println("Match found with similarilty more than 50% between :" + incorrect + " from first list and " + correct + " from second list" );
                    output.add(correct);
                    continue;
                }
            }
        }

        System.out.println("Output: ");

        for (String out : output){
            System.out.println(out);
        }   
    }
}

<强>结果：

匹配发现超过50％之间的相似之处：来自第一个名单的Mohandas Gandhi和来自第二个名单的Mohandas Gandhi

匹配发现类似超过50％之间：第一个名单中的M甘地和第二个名单中的Mohandas Gandhi

匹配发现超过50％之间：第一个名单中的Martin Luther King和第二个名单中的Martin Luther King

匹配发现之间的相似性超过50％：来自第一个名单的Kim Jong和来自第二个名单的Kim Jong

匹配发现超过50％之间的比例：来自第一名单的Abrahm Lincln和来自第二名单的Abrahm Lincln

匹配发现超过50％之间：第一个名单中的Kim Jng和第二个名单中的Kim Jong

最终输出数组：

如何保持数据一致？

2 个答案: