您自己的实现

Question

我有两个列表，我想使用python difflib / sequence匹配器找到匹配的元素，就像这样：

from difflib import SequenceMatcher
def match_seq(list1,list2):
    output=[]
    s = SequenceMatcher(None, list1, list2)
    blocks=s.get_matching_blocks()
    for bl in blocks:
        #print(bl, bl.a, bl.b, bl.size)
        for bi in range(bl.size):
            cur_a=bl.a+bi
            cur_b=bl.b+bi
            output.append((cur_a,cur_b))
    return output

所以当我在两个这样的列表上运行它

list1=["orange","apple","lemons","grapes"]
list2=["pears", "orange","apple", "lemons", "cherry", "grapes"]
for a,b in match_seq(list1,list2):
    print(a,b, list1[a],list2[b])

我得到以下输出：

(0, 1, 'orange', 'orange')
(1, 2, 'apple', 'apple')
(2, 3, 'lemons', 'lemons')
(3, 5, 'grapes', 'grapes')

，但假设我不想仅匹配相同的项目，而是使用匹配功能（例如，可以将Orange与Orange匹配的功能，反之亦然，或者与另一种语言匹配的等效单词的功能）。

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

在difflib / sequence matcher或任何其他python内置库中是否可以提供此选项，因此我可以匹配list3和list 4，还可以匹配list3和list5，就像我对list所做的一样1和list2？

总的来说，您能想到一个解决方案吗？我曾想用目标匹配列表中的每个单词替换目标列表中的每个单词，但这可能会有问题，因为每个单词可能需要多个单词，这可能会干扰序列。

Answer 1

您基本上有三种解决方案：1）编写自己的diff实现； 2）入侵difflib模块； 3）找到解决方法。

您自己的实现

在情况1）中，您可以查看this question 并读了几本书，例如CLRS或Robert Sedgewick的书。

安装`difflib`模块

在情况2）中，请查看source code：get_matching_blocks在find_longest_match处调用line 479。在find_longest_match的核心中，您拥有b2j字典，该字典将列表a的元素映射到列表b中的索引。如果覆盖此词典，则可以实现所需的功能。这是标准版本：

>>> import difflib
>>> from difflib import SequenceMatcher
>>> list3 = ["orange","apple","lemons","grape"]
>>> list4 = ["pears", "oranges","apple", "lemon", "cherry", "grapes"]
>>> s = SequenceMatcher(None, list3, list4)
>>> s.get_matching_blocks()
[Match(a=1, b=2, size=1), Match(a=4, b=6, size=0)]
>>> [(b.a+i, b.b+i, list3[b.a+i], list4[b.b+i]) for b in s.get_matching_blocks() for i in range(b.size)]
[(1, 2, 'apple', 'apple')]

这是被黑的版本：

>>> s = SequenceMatcher(None, list3, list4)
>>> s.b2j
{'pears': [0], 'oranges': [1], 'apple': [2], 'lemon': [3], 'cherry': [4], 'grapes': [5]}
>>> s.b2j = {**s.b2j, 'orange':s.b2j['oranges'], 'lemons':s.b2j['lemon'], 'grape':s.b2j['grapes']}
>>> s.b2j
{'pears': [0], 'oranges': [1], 'apple': [2], 'lemon': [3], 'cherry': [4], 'grapes': [5], 'orange': [1], 'lemons': [3], 'grape': [5]}
>>> s.get_matching_blocks()
[Match(a=0, b=1, size=3), Match(a=3, b=5, size=1), Match(a=4, b=6, size=0)]
>>> [(b.a+i, b.b+i, list3[b.a+i], list4[b.b+i]) for b in s.get_matching_blocks() for i in range(b.size)]
[(0, 1, 'orange', 'oranges'), (1, 2, 'apple', 'apple'), (2, 3, 'lemons', 'lemon'), (3, 5, 'grape', 'grapes')]

这并不难实现自动化，但我不建议您使用该解决方案，因为有一个非常简单的解决方法。

解决方法

想法是按家庭对单词进行分组：

families = [{"pears", "peras"}, {"orange", "oranges", "naranjas"}, {"apple", "manzana"}, {"lemons", "lemon", "limón"}, {"cherry", "cereza"}, {"grape", "grapes"}]

现在很容易创建一个字典，将家庭中的每个单词映射到其中一个单词（我们称其为主要单词）：

>>> d = {w:main for main, *alternatives in map(list, families) for w in alternatives}
>>> d
{'pears': 'peras', 'orange': 'naranjas', 'oranges': 'naranjas', 'manzana': 'apple', 'lemon': 'lemons', 'limón': 'lemons', 'cherry': 'cereza', 'grape': 'grapes'}

请注意，main, *alternatives in map(list, families)使用星号运算符将系列分解成一个主单词（列表的第一个）和备选列表：

>>> head, *tail = [1,2,3,4,5]
>>> head
1
>>> tail
[2, 3, 4, 5]

然后您可以将列表转换为仅使用主词：

>>> list3=["orange","apple","lemons","grape"]
>>> list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
>>> list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]
>>> [d.get(w, w) for w in list3]
['naranjas', 'apple', 'limón', 'grapes']
>>> [d.get(w, w) for w in list4]
['peras', 'naranjas', 'apple', 'limón', 'cereza', 'grapes']
>>> [d.get(w, w) for w in list5]
['peras', 'naranjas', 'apple', 'limón', 'cereza', 'uvas']

如果d.get(w, w)是键，则表达式d[w]将返回w，否则返回w本身。因此，属于一个家庭的单词将转换为该家庭的主要单词，而其他单词则保持不变。

这些列表很容易与difflib进行比较。

重要：与diff算法相比，列表转换的时间复杂度可忽略不计，因此您应该看不出它们之间的区别。

完整代码

完整代码为奖励，

def match_seq(list1, list2):
    """A generator that yields matches of list1 vs list2"""
    s = SequenceMatcher(None, list1, list2)
    for block in s.get_matching_blocks():
        for i in range(block.size):
            yield block.a + i, block.b + i # you don't need to store the matches, just yields them

def create_convert(*families):
    """Return a converter function that converts a list
    to the same list with only main words"""
    d = {w:main for main, *alternatives in map(list, families) for w in alternatives}
    return lambda L: [d.get(w, w) for w in L]

families = [{"pears", "peras"}, {"orange", "oranges", "naranjas"}, {"apple", "manzana"}, {"lemons", "lemon", "limón"}, {"cherry", "cereza"}, {"grape", "grapes", "uvas"}]
convert = create_convert(*families)

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

print ("list3 vs list4")
for a,b in match_seq(convert(list3), convert(list4)):
    print(a,b, list3[a],list4[b])

#  list3 vs list4
# 0 1 orange oranges
# 1 2 apple apple
# 2 3 lemons lemon
# 3 5 grape grapes

print ("list3 vs list5")
for a,b in match_seq(convert(list3), convert(list5)):
    print(a,b, list3[a],list5[b])

# list3 vs list5
# 0 1 orange naranjas
# 1 2 apple manzana
# 2 3 lemons limón
# 3 5 grape uvas

Answer 2

因此，假设您要用应该相互匹配的元素填充列表。除了Generators，我没有使用任何库。我不确定效率，我曾经尝试过这段代码，但我认为它应该很好用。

orange_list = ["orange", "oranges"] # Fill this with orange matching words
pear_list = ["pear", "pears"]
lemon_list = ["lemon", "lemons"]
apple_list = ["apple", "apples"]
grape_list = ["grape", "grapes"]

lists = [orange_list, pear_list, lemon_list, apple_list, grape_list] # Put your matching lists inside this list

def match_seq_bol(list1, list2):
    output=[]
    for x in list1:
        for lst in lists:
            matches = (y for y in list2 if (x in lst and y in lst))
            if matches:
                for i in matches:
                    output.append((list1.index(x), list2.index(i), x,i))
    return output;

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]

print(match_seq_bol(list3, list4))

match_seq_bol()表示基于列表的匹配序列。

匹配list3和list4的输出将是：

[
    (0, 1, 'orange', 'oranges'),
    (1, 2, 'apple', 'apple'),
    (2, 3, 'lemons', 'lemon'),
    (3, 5, 'grape', 'grapes')
]

具有自定义匹配功能的Python序列匹配器

2 个答案:

您自己的实现

安装`difflib`模块

解决方法

完整代码

具有自定义匹配功能的Python序列匹配器

2 个答案:

您自己的实现

安装difflib模块

解决方法

完整代码

安装`difflib`模块