Question

我有两个列表。第一个包含

之类的条目

RB Leipzig vs SV Darmstadt 98
柏林赫塔vs霍芬海姆
..

并且在第二个中包含基本相同的条目但可以以不同的形式书写。例如：

Hertha BSC vs TSG Hoffenheim
RB Leipzig vs Darmstadt 98
..

等等。两个列表都代表相同的体育游戏，但他们可以使用替代团队名称，而不会以相同的顺序出现。

我的目标（hehe pun）是将两个列表统一为一个并匹配相同的条目，并丢弃两个列表中没有出现的条目。

我已经尝试使用Levensthein distance和模糊搜索。我考虑过使用机器学习但不知道如何从那开始。

会恭喜任何帮助和想法！

Answer 1

您可以使用Linear Programming结合您已经提到的Levenshtein距离来解决此问题。线性规划是解决优化问题的常用优化技术，如本例。查看此链接以查找如何使用Solver Foundation in C#的示例。这个例子与您遇到的具体问题无关，但是图书馆的工作原理就是一个很好的例子。

提示：您需要在两个列表之间构建每对团队/字符串之间的距离矩阵。让我们说两个列表都有N个元素。在矩阵的第i行中，您将具有N个值，第j个值将指示来自第一个元素的第i个元素与来自第二个列表的第j个元素之间的Levenshtein距离。然后，您需要设置约束。约束条件是：

每行的总和需要等于1
每列中的总和等于1
每个系数（矩阵条目）需要为0或1

我几个月前解决了同样的问题，这种方法对我来说非常合适。

成本函数将是总和：`

sum（coef [i] [j] * dist [i] [j]，[1，n]中的i和[1，n]中的j）

`。您希望最小化此功能，因为您需要整体距离＆＃34;在映射之后的2组之间尽可能低。

Answer 2

您可以使用BK-tree（我使用Google搜索的C＃实现，找到两个：1，2）。使用Levenshtein距离作为度量。（可选）从列表中的名称中删除全大写子字符串，以便改进度量标准（请注意，这不会意外地为您留下名称的空字符串）。

1. Put the names from the first list in the BK-tree
2. Look up the names from the second list in the BK-tree
  a. Assign an integer token to the name pair, stored in a Map<Integer, Tuple<String, String>>
  b. Replace each team name with the token
3. Sort each token pair (so [8 vs 4] becomes [4 vs 8])
4. Sort each list by its first token in the token pair, 
   then by the second token in the token pair (so the list
   would look like [[1 vs 2], [1 vs 4], [2 vs 4]])

现在你只需遍历两个列表

int i1 = 0
int i2 = 0
while(i1 < list1.length && i2 < list2.length) {
  if(list1[i1].first == list2[i2].first && list1[i1].second == list2[i2].second) {
    // match
    i1++
    i2++
  } else if(list1[i1].first < list2[i2].first) {
    i1++
  } else if(list1[i1].first > list2[i2].first) {
    i2++
  } else if(list1[i1].second < list2[i2].second {
    i1++
  } else {
    i2++
  }
}

在列表中查找不同的匹配条目

2 个答案: