Question

我想比较两个文本文件。第一个文本文件中的行不在第二个文本文件中。我想复制这些行并将它们写到新的txt文件中。我想要一个Python脚本，因为我经常这样做，并且不想经常上网查找这些新行。我不需要确认file2中是否存在file1中没有的东西。

我写了一些似乎不一致的代码。我不确定自己在做什么错。

newLines = open("file1.txt", "r")
originalLines = open("file2.txt", "r")
output = open("output.txt", "w")

lines1 = newLines.readlines()
lines2 = originalLines.readlines()
newLines.close()
originalLines.close()

duplicate = False
for line in lines1:
    if line.isspace():
        continue
    for line2 in lines2:
        if line == line2:
            duplicate = True
            break

    if duplicate == False:
        output.write(line)
    else:
        duplicate = False

output.close()

对于file1.txt：

Man
Dog
Axe
Cat
Potato
Farmer

file2.txt：

Man
Dog
Axe
Cat

output.txt应该是：

Potato
Farmer

但这是这个：

Cat
Potato
Farmer

任何帮助将不胜感激！

Answer 1

根据行为，file2.txt不会以换行符结尾，因此lines2的内容为['Man\n', 'Dog\n', 'Axe\n', 'Cat']。请注意，'Cat'没有换行符。

我建议对您的行进行规范化，以使它们没有换行符，替换为：

lines1 = newLines.readlines()
lines2 = originalLines.readlines()

具有：

lines1 = [line.rstrip('\n') for line in newLines]
# Set comprehension makes lookup cheaper and dedupes
lines2 = {line.rstrip('\n') for line in originalLines}

并进行更改：

output.write(line)

收件人：

print(line, file=output)

这将为您添加换行符。确实，最好的解决方案是完全避免内部循环，更改所有这些内容：

for line2 in lines2:
    if line == line2:
        duplicate = True
        break

if duplicate == False:
    output.write(line)
else:
    duplicate = False

只是：

if line not in lines2:
    print(line, file=output)

如果按照我的建议将set用于lines2，则会使测试成本从file2.txt中的行数线性减少到大致恒定，无论file2.txt的大小（只要唯一的行集完全可以容纳在内存中）。

更好的是，对打开的文件使用with语句，并流式传输file1.txt而不是将其保存在内存中，最终结果是：

with open("file2.txt") as origlines:
    lines2 = {line.rstrip('\n') for line in origlines}

with open("file1.txt") as newlines, open("output.txt", "w") as output:
    for line in newlines:
        line = line.rstrip('\n')
        if not line.isspace() and line not in lines2:
            print(line, file=output)

Answer 2

您可以将numpy用于更小，更快的解决方案。在这里，我们使用这些numpy方法 np.loadtxt 文档：https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html np.setdiff1d 文档：https://docs.scipy.org/doc/numpy-1.14.5/reference/generated/numpy.setdiff1d.html np.savetxt 文档：https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html

import numpy as np


arr=np.setdiff1d(np.loadtxt('file1.txt',dtype=str),np.loadtxt('file2.txt',dtype=str))
np.savetxt('output.txt',b,fmt='%s')

如何编写第一个文本文件中不存在的第二行文本中的行？

2 个答案: