Question

我的问题非常基本。

我需要对由许多行创建的变量进行区分，并仅获取它们的新部分。在一个例子中理解它是非常简单的：

第一个变量：

您好

我

名称

是

第二个变量：

名称

是

彼得

和

我

上午

金发

我需要提取：

彼得

和

我

上午

金发

我需要在大文件中执行此操作。我该怎么办？

非常感谢。

Answer 1

如果重复和顺序无关紧要，这很简单：

first = set(open('firstFile').readlines())
second = set(open('secondFile').readlines())

diff = second - first

如果输出订单很重要：

first = open('firstfile').readlines()
second = open('secondFile').readlines()

diff = [line for line in second if line not in first]

如果输入订单很重要，则需要澄清问题。

如果文件足够大，将它们加载到内存中是一个坏主意，你可能需要这样做：

secondFile = open('secondFile')
diffFile = open('diffFile')

for secondLine in secondFile:
    match = False
    firstFile = open('firstFile')
    for firstLine in firstFile:
        if firstLine == secondLine:
            match = True
            break
    firstfile.close()
    if not match:
        print >>diffFile, secondLine

secondFile.close()

Answer 2

根据对该问题的评论，可以这样做：

first = set(x.strip() for x in open("tmp1.txt").readlines())
second = set(x.strip() for x in open("tmp2.txt").readlines())
print second - first

但是，如果我们认真对待“大”，在处理之前加载整个文件可能会占用比机器上可用的更多的内存。如果第一个文件足够小以适应内存而第二个文件不适合内存，则可以这样做：

first = set(x.strip() for x in open("tmp1.txt").readlines())
for line in open("tmp2.txt").xreadlines():
    line = line.strip()
    if line not in first:
        print line

如果第一个文件太大，我认为您需要求助于数据库。

Python diff并获得新的部分

2 个答案: