Question

我正在阅读两个文件并比较两个文件以找到greylist_previous没有它但greylist_current拥有并输出到新文件的行。我试图找出更快的处理方式。如果你们有更好的解决方案，请回复。谢谢。

prev_f = open('greylist_prev.txt')
current_f=open('greylist_current.txt')
greylist_f=open('greylist.txt','w')

prev_line = prev_f.readlines()
greylist=[]
total_line=sum(1 for line in open('greylist_current.txt'))

if total_line < 10:
  greylist_f.write("\n")


else:
   for current_line in current_f:
          if current_line not in prev_line:
                greylist.append(current_line)


for line in greylist:
   greylist_f.write("%s" %line)


prev_f.close()
current_f.close()
greylist_f.close()

与linux命令

的结果相同

awk -F, 'NR==FNR{_1[$1]++;next}!_1[$1]' greylist_prev.txt greylist_current.txt > greylist.txt

Answer 1

您可以简单地sort输入两个文件，稍后再join个。这样可以避免将所有密钥放入主内存（如果文件很大），并且其时间复杂度取决于排序任务 - 通常为O(n log n)。例如，对于文件A和B：

file A | file B
1,foo  | 1,bar
2,fooo | 3,baar
       | 4,gee

你可以跑：

$ join -v 2 -t ',' <(sort -t ',' a.txt) <(sort -t ',' b.txt)
3,baar
4,gee

Answer 2

这些行是罪魁祸首：

for current_line in current_f:
     if current_line not in prev_line:
           greylist.append(current_line)

查找大型列表中的项目相对于查找字典中的项目来说速度较慢。这种速度是以记忆为代价的。您可以通过将greylist_prev.txt文件的行存储在字典而不是列表中来加快代码速度。

打开文件：

file = open('greylist_prev.txt', 'r').read().split('\n')

创建一个dict，并将项目存储在文件中作为键：

d = {}
for i in file:
    d[i] = ''

然后将罪魁祸首行修改为：

else:
   for current_line in current_f:
          if current_line not in d:
                greylist.append(current_line)

Answer 3

您应该使用difflib模块。根据文件：

该模块提供用于比较序列的类和函数。它可以用于例如比较文件，并且可以以各种格式产生差异信息，包括HTML和上下文以及统一差异。有关目录和文件的比较，另请参阅filecmp模块。

difflib.Differ 是您正在寻找的。

这是一个用于比较文本行序列并产生人类可读差异或增量的类。 Differ使用SequenceMatcher来比较行序列，并比较相似（近似匹配）行中的字符序列。

比较两个文件，找到diff输出到新文件（更快的方式）

3 个答案: