Question

我有两个未排序的文本文件（大小在150MB到1GB之间）。

我想找到a.txt中出现的所有行，而b.txt中不出现的所有行。

a.txt包含->

qwe
asd
zxc
rty

b.txt包含->

qwe
zxc

如果我将a.txt和'b.txt in c.txt`结合使用，则会得到：

qwe
asd
zxc
rty
qwe
zxc

我按字母顺序对它们进行排序并得到：

asd
qwe
qwe
rty
zxc
zxc

然后，我使用regx模式搜索（。*）\ n（\ 1）\ n并将其全部替换为null，然后将所有\ n \ n替换为\ n多次以获取两者之间的“差异”两个文件。

现在我无法在python中这样做。我可以做到这一点，直到排序部分，但正则表达式似乎不能在多行中工作。这是我的python代码

f = open("output.txt", 'w')
s = open(outputfile,'r+')
for line in s.readlines():
    s = line.replace('(.*)\n(\1)\n', '')
    f.write(s)

f.close()

Answer 1

我能够做到这一点，直到排序部分，但正则表达式似乎无法在多行中使用。

您的正则表达式很好。您没有多行。您有单行：

for line in s.readlines():

file.readlines()将所有文件作为行列表读入内存。然后，您遍历每行的，因此line将是'asd\n'或'qwe\n'，并且从不 {{1 }}。

鉴于您正在将所有合并的文件读入内存，因此我假设您的文件没有那么大。在这种情况下，将其中一个文件读入一个设置的对象，然后测试另一文件的每一行以找出差异，将容易得多：

'qwe\nqwe\n'

如果要将这些内容全部写到文件中，可以将两个序列组合在一起并写出排序列表：

with open('a.txt', 'r') as file_a:
    lines = set(file_a)  # all lines, as a set, with newlines

new_in_b = []
with open('b.txt', 'r') as file_b:
    for line in file_b:
        if line in lines:
            # present in both files, remove from `lines` to find extra lines in a
            lines.remove(line)
        else:
            # extra line in b
            new_in_b.append(line)

print('Lines in a missing from b')
for line in sorted(lines):
    print(line.rstrip())  # remove the newline when printing.
print()

print('Lines in b missing from a')
for line in new_in_b:
    print(line.rstrip())  # remove the newline when printing.
print()

您的方法也可以，首先对行进行排序，然后将所有行放入文件中，然后匹配成对的行。您需要做的就是记住前一行。与当前行一起，这是一对。请注意，您不需要正则表达式，只需进行相等性测试：

with open('c.txt', 'w') as file_c:
    file_c.writelines(sorted(list(lines) + new_in_b))

请注意，这永远不会将整个文件读入内存！直接在文件上进行迭代会给您单独的行，在该行中，将文件分块读取到缓冲区中。这是一种非常有效的生产线方法。

您还可以使用itertools library一次遍历文件两行，以结束文件对象迭代器：

with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
    preceding = None
    skip = False
    for line in file_c:
        if preceding and preceding == line:
            # skip writing this line, but clear 'preceding' so we don't
            # check the next line against it
            preceding = None
        else:
            outfile.write(preceding)
            preceding = line
    # write out the last line
    if preceding:
        outfile.write(preceding)

第三种方法是使用itertools.groupby()对相等的行进行分组。然后，您可以决定如何处理这些小组：

with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
    iter1, iter2 = tee(file_c)  # two iterators with shared source
    line2 = next(iter2, None)  # move second iterator ahead a line
    # iterate over this and the next line, and add a counter
    for i, (line1, line2) in enumerate(zip(iter1, iter2)):
        if line1 != line2:
            outfile.write(line1)
        else:
            # clear the last line so we don't try to write it out
            # at the end
            line2 = None
    # write out the last line if it didn't match the preceding
    if line2:
        outfile.write(line2)

我假设同一行中是否有2个或更多副本并不重要。换句话说，您不想配对，只想查找唯一的行（那些只出现在a或b中的行）。

如果文件非常大但已经排序，则可以使用合并排序方法，无需，而无需手动将两个文件合并为一个。如果输入分别进行排序，则heapq.merge() function会为您提供来自多个文件的行，并按排序顺序。与from itertools import groupby with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile: for line, group in groupby(file_c): # group is an iterator of all the lines in c that are equal # the same value is already in line, so all we need to do is # *count* how many such lines there are: count = sum(1 for line in group) # get an efficient count if count == 1: # line is unique, write it out outfile.write(line)一起使用：

groupby()

同样，这些方法仅从每个文件读取足够的数据以填充缓冲区。 import heapq from itertools import groupby # files a.txt and b.txt are assumed to be sorted already with open('a.txt', 'r') as file_a, open('b.txt', 'r') as file_b,\ open('output.txt', 'w') as outfile: for line, group in groupby(heapq.merge(file_a, file_b)): count = sum(1 for line in group) if count == 1: outfile.write(line)迭代器一次只能保存两行，heapq.merge()也是如此。这使您可以处理任何大小的文件，而不管您的内存限制如何。

在两个文本文件中获取唯一的行

1 个答案: