比较两个文件的python差异

时间:2016-10-13 20:05:25

标签: python linux diff

我想比较两个文件(从第一个文件中获取行并在整个第二个文件中查找)以查看它们之间的差异,并将fileA.txt中的缺失行写入fileB.txt的末尾。我是python的新手,所以我第一次想到这样简单的程序:

import difflib

file1 = "fileA.txt"
file2 = "fileB.txt"

diff = difflib.ndiff(open(file1).readlines(),open(file2).readlines())
print ''.join(diff),

但结果我得到了两个文件的组合,每行都有合适的标签。我知道我可以用标签“ - ”查找行开头,然后将其写入文件fileB.txt的末尾,但是使用大文件(~100 MB)这种方法效率很低。有人可以帮助我改进计划吗?

文件结构如下:

输入:

fileA.txt

Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct  9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)

fileB.txt

    Oct  9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct  9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)

输出:

fileB_after.txt

Oct  9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct  9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct  9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)

2 个答案:

答案 0 :(得分:1)

bash

中尝试使用此功能
cat fileA.txt fileB.txt | sort -M | uniq > new_file.txt

<强> sort -M      根据初始字符串进行排序,其中包含任意数量的空格      按月份名称缩写,折叠为UPPER案例并进行比较      按顺序'JAN'&lt; 'FEB'&lt; ......&lt; 'DEC'。无效的名称比较      低到有效的名字。 `LC_TIME'语言环境决定月份      拼写。

uniq:过滤掉文件中的重复行。

|:将一个命令的输出传递给另一个命令以进行进一步处理。

这将采取两个文件,按照上述方式对它们进行排序,保留唯一的项目并将它们存储在new_file.txt

注意:这不是python解决方案,但您已使用linux标记了问题,所以我认为您可能感兴趣。您还可以找到有关所用命令的更多详细信息,here

答案 1 :(得分:1)

读入两个文件并转换为设置

找到两套联盟
基于时间的排序联合集合 将set set to string with new line

import datetime
import 
file1 = "fileA.txt"
file2 = "fileB.txt"

with open(file1 ,'rb') as f:
  sa = set( line for line in f )
with open(file2 ,'rb') as f:
  sb = set( line for line in f )
print '\n'.join( sorted( sa.union(sb), key = lambda x: datetime.datetime.strptime( ' '.join( x.split()[:3]), '%b %d %H:%M:%S' )) )



Oct  9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct  9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct  9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct  9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2