Question

我编写了以下python代码来删除重复项：

lines_seen = set()
outfile = open("out.txt", "w")
for line in open("file.txt", "r"):
    if line not in lines_seen: 
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

上面的代码正常运行并删除完全相同的重复项，但我希望能够从一行中删除具有3个或更多确切单词匹配的重复项。例如：

The apple is red
The apple red
The banana is yellow
The apple is red

当前代码的输出是：

The apple is red
The apple red
The banana is yellow

但我想删除短语'The apple red'，因为它在行中有3个匹配的单词。我希望这是有道理的。我如何在python中写这个？

Answer 1

一个非常简单的方法可以做你想要的就是迭代到目前为止在每一行中看到的单词集列表：

lines_seen = []
outfile = open("out.txt", "w")
for line in open("file.txt", "r"):
    words = set(line.split())
    for word_set in lines_seen:
        if len(words.intersection(word_set)) >= 3:
            break
    else:
        outfile.write(line)
        lines_seen.append(words)
outfile.close()

产量

The apple is red
The banana is yellow

当然，这忽略了对你的问题的评论中提到的一些细微之处。使用difflib等专业库可能会更好。

Answer 2

看一下字符串距离函数：

汉明距离
Levenshtein距离
Jaro-Winkler距离

还有fuzzy string matching的Python包 - 我相信这个实现了方法2.这些不会像你提到的那样进行单词匹配，但是字符串距离可能是一种更强大的方法来实现你的目标

如何使用python从文本文件中删除类似的重复项？

2 个答案: