基本上,我想为我的日常任务创建一个Python脚本,我希望将两个文件与任何大小的文件进行比较。想要生成2个具有匹配记录的新文件&来自两个文件的不匹配记录。
我在下面写了python脚本&发现它的文件大小正常,记录很少。
但是当我用200,000和500,000条记录的文件执行相同的脚本时,生成的结果文件没有给出有效的输出。
那么,您可以检查下面的脚本并帮助识别导致错误输出的问题......?
提前致谢。
from sys import argv
script, filePathName1, filePathName2 = argv
def FileDifference(filePathName1, filePathName2):
fileObject1 = open(filePathName1,'r')
fileObject2 = open(filePathName2,'r')
newFilePathName1 = filePathName1 + ' - NonMatchingRecords.txt'
newFilePathName2 = filePathName1 + ' - MatchingRecords.txt'
newFileObject1 = open(newFilePathName1,'a')
newFileObject2 = open(newFilePathName2,'a')
file1 = fileObject1.readlines()
file2 = fileObject2.readlines()
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for j in range(0,len(Matching)):
newFileObject2.write(Matching[j])
fileObject1.close()
fileObject2.close()
newFileObject1.close()
newFileObject2.close()
FileDifference(filePathName1, filePathName2)
编辑-1:请注意上面的程序执行没有任何错误。它只是输出不正确,程序需要更长的时间来克服大文件。
答案 0 :(得分:1)
我会猜测并假设"没有有效的输出"意味着:"永远运行,没有任何用处"。
由于你的列表理解,这是合乎逻辑的:
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for i in range(0,len(Matching)):
newFileObject2.write(Matching[i])
他们执行O(n)
查询,这对于少数几行是可以的,但如果len(file1) == 100000
说file2
,那么永远不会结束。sets
。然后,您执行100000 * 100000次迭代=> 10 ** 10 =>永远。
修复很简单:创建intersection
并使用difference
& file1 = set(fileObject1.readlines())
file2 = set(fileObject2.readlines())
difference = file1 - file2
for i in difference:
newFileObject1.write(i)
matching = file1 & file2
for i in matching:
newFileObject2.write(matching)
,速度更快。
{{1}}