Question

我有两个字符串，我想找到它们的联合。在这样做的同时，我想维持秩序。我这样做的目的是，我尝试了几种方法来对图像进行OCR并获得不同的结果。我想将所有不同的结果合并到一个具有最多内容的结果中。

这至少是我之后的事情：

#example1
string1 = "This is a test trees are green roses are red"
string2 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
finalstring = "this is a test trees are green roses are red 12.48.1952 anthony gonzalez" 

#example2
string2 = "This is a test trees are green roses are red"
string1 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
finalstring = "this is a test trees are green roses are red 12.48.1952 anthony gonzalez"

#example3
string1 = "telephone conversation in some place big image on screen"
String2 = "roses are red telephone conversation in some place big image on screen"
finalstring = "roses are red telephone conversation in some place big image on screen"
#or the following - both are fine in this scenario.
finalstring = "telephone conversation in some place big image on screen roses are red "

这是我尝试过的：

>>> string1 = "This is a test trees are green roses are red"
>>> string2 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
>>> list1 = string1.split(" ")
>>> list2 = string2.split(" ")
>>> " ".join(list(set(list1) | set(list2))).lower()
'a gonzalez this is trees anthony roses green are test 12.48.1952 test is red'

Answer 1

您可以使用difflib.SequenceMatcher：

import difflib
def merge (l, r):
    m = difflib.SequenceMatcher(None, l, r)
    for o, i1, i2, j1, j2 in m.get_opcodes():
        if o == 'equal':
            yield l[i1:i2]
        elif o == 'delete':
            yield l[i1:i2]
        elif o == 'insert':
            yield r[j1:j2]
        elif o == 'replace':
            yield l[i1:i2]
            yield r[j1:j2]

像这样使用：

>>> string1 = 'This is a test trees are green roses are red'
>>> string2 = 'This iS a TEST trees 12.48.1952 anthony gonzalez'

>>> merged = merge(string1.lower().split(), string2.lower().split())
>>> ' '.join(' '.join(x) for x in merged)
'this is a test trees are green roses are red 12.48.1952 anthony gonzalez'

如果要在字符级别执行合并，可以直接修改调用以直接操作字符串（而不是单词列表）：

>>> merged = merge(string1.lower(), string2.lower())
>>> ''.join(merged)
'this is a test trees 12.48.1952 arenthony gronzaleen roses are redz'

此解决方案可以正确维护字符串各个部分的顺序。因此，如果两个字符串以公共部分结尾但在结尾之前具有不同的片段，那么这两个不同的片段仍然会在结果中的公共结束之前出现。例如，合并A B D和A C D将为您提供A B C D。

因此，只需删除结果字符串的部分内容，即可以正确的顺序找到每个原始字符串。如果从该示例结果中删除C，则会返回第一个字符串;如果你删除B，你会得到第二个字符串。这也适用于更复杂的合并。

Answer 2

不要为此使用一套。您必须注意到，只有一个使其成为最终结果，因为set()会保留唯一对象。

string1 = "This is a test trees are green roses are red"
string2 = "This iS a TEST trees 12.48.1952 anthony gonzalez"

str_lst = string1.split()

for s, t in zip(string1.split(), string2.split()):
    if s.lower() == t.lower():
        continue
    else:
        str_lst.append(t)

string = " ".join(s.lower() for s in str_lst)
#this is a test trees are green roses are red 12.48.1952 anthony gonzalez

Answer 3

" ".join(x if i >= len(string2.split()) or x == string2.lower().split()[i] else " ".join((x, string2.split()[i])) for i, x in enumerate(string1.lower().split()))

你可以使用生成器理解和这样的join来完成你想要的。这会将i设置为该string1和x中单词的索引。然后检查该单词是否在string2中，如果没有，则将string2中的单词添加到i到x，将两个单词放在最后一个字符串中。

如何找到两个字符串的并集并维护顺序

3 个答案: