在两个文本文件中获取唯一的行

时间:2018-07-05 13:07:13

标签: python text duplicates

我有两个未排序的文本文件(大小在150MB到1GB之间)。

我想找到a.txt中出现的所有行,而b.txt出现的所有行。

a.txt包含->

qwe
asd
zxc
rty

b.txt包含->

qwe
zxc

如果我将a.txt和'b.txt in c.txt`结合使用,则会得到:

qwe
asd
zxc
rty
qwe
zxc

我按字母顺序对它们进行排序并得到:

asd
qwe
qwe
rty
zxc
zxc

然后,我使用regx模式搜索(。*)\ n(\ 1)\ n并将其全部替换为null,然后将所有\ n \ n替换为\ n多次以获取两者之间的“差异”两个文件。

现在我无法在python中这样做。我可以做到这一点,直到排序部分,但正则表达式似乎不能在多行中工作。 这是我的python代码

f = open("output.txt", 'w')
s = open(outputfile,'r+')
for line in s.readlines():
    s = line.replace('(.*)\n(\1)\n', '')
    f.write(s)

f.close() 

1 个答案:

答案 0 :(得分:0)

  

我能够做到这一点,直到排序部分,但正则表达式似乎无法在多行中使用。

您的正则表达式很好。您没有多行。您有行:

for line in s.readlines():

file.readlines()将所有文件作为行列表读入内存。然后,您遍历每行的 ,因此line将是'asd\n''qwe\n',并且从不 {{1 }}。

鉴于您正在将所有合并的文件读入内存,因此我假设您的文件没有那么大。在这种情况下,将其中一个文件读入一个设置的对象,然后测试另一文件的每一行以找出差异,将容易得多

'qwe\nqwe\n'

如果要将这些内容全部写到文件中,可以将两个序列组合在一起并写出排序列表:

with open('a.txt', 'r') as file_a:
    lines = set(file_a)  # all lines, as a set, with newlines

new_in_b = []
with open('b.txt', 'r') as file_b:
    for line in file_b:
        if line in lines:
            # present in both files, remove from `lines` to find extra lines in a
            lines.remove(line)
        else:
            # extra line in b
            new_in_b.append(line)

print('Lines in a missing from b')
for line in sorted(lines):
    print(line.rstrip())  # remove the newline when printing.
print()

print('Lines in b missing from a')
for line in new_in_b:
    print(line.rstrip())  # remove the newline when printing.
print()

您的方法也可以,首先对行进行排序,然后将所有行放入文件中,然后匹配成对的行。您需要做的就是记住前一行。与当前行一起,这是一对。请注意,您不需要正则表达式,只需进行相等性测试:

with open('c.txt', 'w') as file_c:
    file_c.writelines(sorted(list(lines) + new_in_b))

请注意,这永远不会将整个文件读入内存!直接在文件上进行迭代会给您单独的行,在该行中,将文件分块读取到缓冲区中。这是一种非常有效的生产线方法。

您还可以使用itertools library一次遍历文件两行,以结束文件对象迭代器:

with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
    preceding = None
    skip = False
    for line in file_c:
        if preceding and preceding == line:
            # skip writing this line, but clear 'preceding' so we don't
            # check the next line against it
            preceding = None
        else:
            outfile.write(preceding)
            preceding = line
    # write out the last line
    if preceding:
        outfile.write(preceding)

第三种方法是使用itertools.groupby()对相等的行进行分组。然后,您可以决定如何处理这些小组:

with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
    iter1, iter2 = tee(file_c)  # two iterators with shared source
    line2 = next(iter2, None)  # move second iterator ahead a line
    # iterate over this and the next line, and add a counter
    for i, (line1, line2) in enumerate(zip(iter1, iter2)):
        if line1 != line2:
            outfile.write(line1)
        else:
            # clear the last line so we don't try to write it out
            # at the end
            line2 = None
    # write out the last line if it didn't match the preceding
    if line2:
        outfile.write(line2)

我假设同一行中是否有2个或更多副本并不重要。换句话说,您不想配对,只想查找唯一的行(那些只出现在a或b中的行)。

如果文件非常大但已经排序,则可以使用合并排序方法,无需,而无需手动将两个文件合并为一个。如果输入分别进行排序,则heapq.merge() function会为您提供来自多个文件的行,并按排序顺序。与from itertools import groupby with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile: for line, group in groupby(file_c): # group is an iterator of all the lines in c that are equal # the same value is already in line, so all we need to do is # *count* how many such lines there are: count = sum(1 for line in group) # get an efficient count if count == 1: # line is unique, write it out outfile.write(line) 一起使用:

groupby()

同样,这些方法仅从每个文件读取足够的数据以填充缓冲区。 import heapq from itertools import groupby # files a.txt and b.txt are assumed to be sorted already with open('a.txt', 'r') as file_a, open('b.txt', 'r') as file_b,\ open('output.txt', 'w') as outfile: for line, group in groupby(heapq.merge(file_a, file_b)): count = sum(1 for line in group) if count == 1: outfile.write(line) 迭代器一次只能保存两行,heapq.merge()也是如此。这使您可以处理任何大小的文件,而不管您的内存限制如何。