Question

我已经抓取了来自不同网站的txt文件，现在我需要将它们粘贴到一个文件中。来自各个网站的许多行彼此相似。我想删除重复。这是我尝试过的：

import difflib

sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
sourcefile = open('%s.txt' % sourcename)
sourcelines = sourcefile.readlines()
sourcefile.close()
for sourceline in sourcelines:

    destfile = open('%s.txt' % destname, 'a+')
    destlines = destfile.readlines()

    similar = False
    for destline in destlines:
        ratio = difflib.SequenceMatcher(None, destline, sourceline).ratio()
        if ratio > 0.8:
            print destline
            print sourceline
            similar = True

    if not similar:
        destfile.write(sourceline)

    destfile.close()

我会为每个源运行它，并逐行写入同一个文件。结果是，即使我多次为同一个文件运行它，该行也始终附加到目标文件。

编辑：我已经尝试了答案的代码。它仍然很慢。即使我最小化IO，我仍然需要比较O（n ^ 2），特别是当你有1000多行时。我每个文件平均有10,000行。

还有其他方法可以删除重复项吗？

Answer 1

这是一个简短的版本，可以完成最小的IO并自行清理。

import difflib

sourcename = 'xiaoshanwujzw'
destname = 'bindresult'

with open('%s.txt' % destname, 'w+') as destfile:

  # we read in the file so that on subsequent runs of this script, we 
  # won't duplicate the lines.
  known_lines = set(destfile.readlines())

  with open('%s.txt' % sourcename) as sourcefile:
    for line in sourcefile:
      similar = False
      for known in known_lines:
        ratio = difflib.SequenceMatcher(None, line, known).ratio()
        if ratio > 0.8:
          print ratio
          print line
          print known
          similar = True
          break
      if not similar:
        destfile.write(line)
        known_lines.add(line)

我们不是每次都从文件中读取已知行，而是将它们保存到一个集合中，我们将其用于比较。该集基本上是'destfile'内容的镜像。

关于复杂性的说明

就其本质而言，这个问题具有O（n ²）的复杂性。因为您正在寻找具有已知字符串的相似性，而不是相同的字符串，所以您必须查看以前看到的每个字符串。如果您希望删除精确的重复项而不是模糊匹配项，则可以在集合中使用简单查找，复杂度为O（1），从而使整个解决方案具有O（n）复杂度。

可能有一种方法可以通过对字符串使用有损压缩来降低基本复杂性，以便两个相似的字符串压缩到相同的结果。然而，这超出了堆栈溢出答案的范围，超出了我的专业知识。它是an active research area所以你可能会在文献中找到一些运气。

您还可以使用不太准确的替代ratio()和quick_ratio()来缩短real_quick_ratio()所用的时间。

Answer 2

您的代码适用于我。当行相似时（在我使用的示例中，它完全相同）它将stline输出到destd和sourceline，但它只写了一次唯一的行文件。您可能需要根据特定的“相似性”需求设置较低的ratio阈值。

Answer 3

基本上你需要做的是检查源文件中的每一行，看看它是否与目标文件的每一行都有匹配。

##xiaoshanwujzw.txt
##-----------------
##radically different thing
##this is data
##and more data

##bindresult.txt
##--------------
##a website line
##this is data
##and more data

from difflib import SequenceMatcher

sourcefile = open('xiaoshanwujzw.txt', 'r')
sourcelines = sourcefile.readlines()
sourcefile.close()

destfile = open('bindresult.txt', 'a+')
destlines = destfile.readlines()


has_matches = {k: False for k in sourcelines}

for d_line in destlines:

    for s_line in sourcelines:

        if SequenceMatcher(None, d_line, s_line).ratio() > 0.8:
            has_matches[s_line] = True
            break

for k in has_matches:
    if has_matches[k] == False:
        destfile.write(k)

destfile.close()

这会将完全不同的东西添加到目标文件中。

python从多个文件中删除类似的字符串

3 个答案:

关于复杂性的说明