我有两个未排序的文本文件(大小在150MB到1GB之间)。
我想找到a.txt
中出现的所有行,而b.txt
中不出现的所有行。
a.txt
包含->
qwe
asd
zxc
rty
b.txt
包含->
qwe
zxc
如果我将a.txt
和'b.txt in
c.txt`结合使用,则会得到:
qwe
asd
zxc
rty
qwe
zxc
我按字母顺序对它们进行排序并得到:
asd
qwe
qwe
rty
zxc
zxc
然后,我使用regx模式搜索(。*)\ n(\ 1)\ n并将其全部替换为null,然后将所有\ n \ n替换为\ n多次以获取两者之间的“差异”两个文件。
现在我无法在python中这样做。我可以做到这一点,直到排序部分,但正则表达式似乎不能在多行中工作。 这是我的python代码
f = open("output.txt", 'w')
s = open(outputfile,'r+')
for line in s.readlines():
s = line.replace('(.*)\n(\1)\n', '')
f.write(s)
f.close()
答案 0 :(得分:0)
我能够做到这一点,直到排序部分,但正则表达式似乎无法在多行中使用。
您的正则表达式很好。您没有多行。您有单行:
for line in s.readlines():
file.readlines()
将所有文件作为行列表读入内存。然后,您遍历每行的 ,因此line
将是'asd\n'
或'qwe\n'
,并且从不 {{1 }}。
鉴于您正在将所有合并的文件读入内存,因此我假设您的文件没有那么大。在这种情况下,将其中一个文件读入一个设置的对象,然后测试另一文件的每一行以找出差异,将容易得多:
'qwe\nqwe\n'
如果要将这些内容全部写到文件中,可以将两个序列组合在一起并写出排序列表:
with open('a.txt', 'r') as file_a:
lines = set(file_a) # all lines, as a set, with newlines
new_in_b = []
with open('b.txt', 'r') as file_b:
for line in file_b:
if line in lines:
# present in both files, remove from `lines` to find extra lines in a
lines.remove(line)
else:
# extra line in b
new_in_b.append(line)
print('Lines in a missing from b')
for line in sorted(lines):
print(line.rstrip()) # remove the newline when printing.
print()
print('Lines in b missing from a')
for line in new_in_b:
print(line.rstrip()) # remove the newline when printing.
print()
您的方法也可以,首先对行进行排序,然后将所有行放入文件中,然后匹配成对的行。您需要做的就是记住前一行。与当前行一起,这是一对。请注意,您不需要正则表达式,只需进行相等性测试:
with open('c.txt', 'w') as file_c:
file_c.writelines(sorted(list(lines) + new_in_b))
请注意,这永远不会将整个文件读入内存!直接在文件上进行迭代会给您单独的行,在该行中,将文件分块读取到缓冲区中。这是一种非常有效的生产线方法。
您还可以使用itertools
library一次遍历文件两行,以结束文件对象迭代器:
with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
preceding = None
skip = False
for line in file_c:
if preceding and preceding == line:
# skip writing this line, but clear 'preceding' so we don't
# check the next line against it
preceding = None
else:
outfile.write(preceding)
preceding = line
# write out the last line
if preceding:
outfile.write(preceding)
第三种方法是使用itertools.groupby()
对相等的行进行分组。然后,您可以决定如何处理这些小组:
with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
iter1, iter2 = tee(file_c) # two iterators with shared source
line2 = next(iter2, None) # move second iterator ahead a line
# iterate over this and the next line, and add a counter
for i, (line1, line2) in enumerate(zip(iter1, iter2)):
if line1 != line2:
outfile.write(line1)
else:
# clear the last line so we don't try to write it out
# at the end
line2 = None
# write out the last line if it didn't match the preceding
if line2:
outfile.write(line2)
我假设同一行中是否有2个或更多副本并不重要。换句话说,您不想配对,只想查找唯一的行(那些只出现在a或b中的行)。
如果文件非常大但已经排序,则可以使用合并排序方法,无需,而无需手动将两个文件合并为一个。如果输入分别进行排序,则heapq.merge()
function会为您提供来自多个文件的行,并按排序顺序。与from itertools import groupby
with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
for line, group in groupby(file_c):
# group is an iterator of all the lines in c that are equal
# the same value is already in line, so all we need to do is
# *count* how many such lines there are:
count = sum(1 for line in group) # get an efficient count
if count == 1:
# line is unique, write it out
outfile.write(line)
一起使用:
groupby()
同样,这些方法仅从每个文件读取足够的数据以填充缓冲区。 import heapq
from itertools import groupby
# files a.txt and b.txt are assumed to be sorted already
with open('a.txt', 'r') as file_a, open('b.txt', 'r') as file_b,\
open('output.txt', 'w') as outfile:
for line, group in groupby(heapq.merge(file_a, file_b)):
count = sum(1 for line in group)
if count == 1:
outfile.write(line)
迭代器一次只能保存两行,heapq.merge()
也是如此。这使您可以处理任何大小的文件,而不管您的内存限制如何。