我想比较两个文件(从第一个文件中获取行并在整个第二个文件中查找)以查看它们之间的差异,并将fileA.txt中的缺失行写入fileB.txt的末尾。我是python的新手,所以我第一次想到这样简单的程序:
import difflib
file1 = "fileA.txt"
file2 = "fileB.txt"
diff = difflib.ndiff(open(file1).readlines(),open(file2).readlines())
print ''.join(diff),
但结果我得到了两个文件的组合,每行都有合适的标签。我知道我可以用标签“ - ”查找行开头,然后将其写入文件fileB.txt的末尾,但是使用大文件(~100 MB)这种方法效率很低。有人可以帮助我改进计划吗?
文件结构如下:
输入:
fileA.txt
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct 9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct 9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)
fileB.txt
Oct 9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct 9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct 9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
输出:
fileB_after.txt
Oct 9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct 9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct 9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct 9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct 9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)
答案 0 :(得分:1)
在bash
:
cat fileA.txt fileB.txt | sort -M | uniq > new_file.txt
<强> sort -M 强> 根据初始字符串进行排序,其中包含任意数量的空格 按月份名称缩写,折叠为UPPER案例并进行比较 按顺序'JAN'&lt; 'FEB'&lt; ......&lt; 'DEC'。无效的名称比较 低到有效的名字。 `LC_TIME'语言环境决定月份 拼写。
uniq:过滤掉文件中的重复行。
|:将一个命令的输出传递给另一个命令以进行进一步处理。
这将采取两个文件,按照上述方式对它们进行排序,保留唯一的项目并将它们存储在new_file.txt
注意:这不是python解决方案,但您已使用linux
标记了问题,所以我认为您可能感兴趣。您还可以找到有关所用命令的更多详细信息,here。
答案 1 :(得分:1)
读入两个文件并转换为设置
找到两套联盟
基于时间的排序联合集合
将set set to string with new line
import datetime
import
file1 = "fileA.txt"
file2 = "fileB.txt"
with open(file1 ,'rb') as f:
sa = set( line for line in f )
with open(file2 ,'rb') as f:
sb = set( line for line in f )
print '\n'.join( sorted( sa.union(sb), key = lambda x: datetime.datetime.strptime( ' '.join( x.split()[:3]), '%b %d %H:%M:%S' )) )
Oct 9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct 9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct 9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct 9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct 9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct 9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct 9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct 9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2