找到两个文件之间的差异非常慢

时间:2017-05-22 17:17:05

标签: python performance comparison

我真的是python的初学者,但是试图将从两个数据库中提取的一些数据比较到文件中。在脚本中,我使用每个数据库内容的字典,如果找到差异,我将其添加到字典中。它们是前两个值(代码和子代码)的组合,值是与该代码/子代码组合相关联的longCodes列表。总的来说,我的剧本有效,但如果它的构造非常可怕且效率低下,我不会感到惊讶。正在处理的样本数据如下:

0,0,83
0,1,157
1,1,158
1,2,159
1,3,210
2,0,211
2,1,212
2,2,213
2,2,214
2,2,215

这个想法是数据应该同步,但有时它不是,我试图检测差异。实际上,当我从DB中提取数据时,每个文件中有超过100万行。性能看起来不是那么好(也许它可以做得很好?),需要大约35分钟来处理并给我结果。如果有任何提高性能的建议,我会很乐意接受!

import difflib, sys, csv, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
    diff = difflib.ndiff(f1.readlines(),f2.readlines())
    for line in diff:
        if line.startswith('-'):
            line = line[2:]
            codeSubCode = ",".join(line.split(",", 2)[:2])
            longCode = ",".join(line.split(",", 2)[2:]).rstrip()
            if not codeSubCode in masterDb:
                masterDb[codeSubCode] = [(longCode)]
            else:
                masterDb[codeSubCode].append(longCode)
        elif line.startswith('+'):
            line = line[2:]
            codeSubCode = ",".join(line.split(",", 2)[:2])
            longCode = ",".join(line.split(",", 2)[2:]).rstrip()
            if not codeSubCode in slaveDb:
                slaveDb[codeSubCode] = [(longCode)]
            else:
                slaveDb[codeSubCode].append(longCode)

f1.close()
f2.close()

2 个答案:

答案 0 :(得分:1)

试试这个:

import difflib, sys, csv, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
    diff = difflib.ndiff(f1.readlines(),f2.readlines())
    for line in diff:
        if line.startswith('-'):
            line = line[2:]
            sp=",".join(line.split(",", 2)[:2])
            codeSubCode = sp
            longCode = sp.rstrip()
            try:
                masterDb[codeSubCode].append(longCode)
            except:
                masterDb[codeSubCode] = [(longCode)]
        elif line.startswith('+'):
            line = line[2:]
            sp=",".join(line.split(",", 2)[:2])
            codeSubCode = sp
            longCode = sp.rstrip()               
            try:
                slaveDb[codeSubCode].append(longCode)
            except:
                slaveDb[codeSubCode] = [(longCode)]

f1.close()
f2.close()

答案 1 :(得分:0)

所以我最终使用不同的逻辑来提出一个更高效的脚本。非常感谢https://stackoverflow.com/users/100297/martijn-pieters提供帮助。

#!/usr/bin/python

import csv, sys, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
outFile = open('results.csv', 'wb')

#First find entries in SLAVE that dont match MASTER
with open('masterDbCodes.lst', 'rb') as master:
    reader1 = csv.reader(master)
    master_rows = {tuple(r) for r in reader1}

with open('slaveDbCodes.lst', 'rb') as slave:
    reader = csv.reader(slave)

    for row in reader:
        if tuple(row) not in master_rows:
            code = row[0]
            subCode = row[1]
            codeSubCode = ",".join([code, subCode])
            longCode = row[2]
            if not codeSubCode in slaveDb:
                slaveDb[codeSubCode] = [(longCode)]
            else:
                slaveDb[codeSubCode].append(longCode)

#Now find entries in MASTER that dont match SLAVE
with open('slaveDbCodes.lst', 'rb') as slave:
    reader1 = csv.reader(slave)
    slave_rows = {tuple(r) for r in reader1}

with open('masterDbCodes.lst', 'rb') as master:
    reader = csv.reader(master)

    for row in reader:
        if tuple(row) not in slave_rows:
            code = row[0]
            subCode = row[1]
            codeSubCode = ",".join([code, subCode])
            longCode = row[2]
            if not codeSubCode in masterDb:
                masterDb[codeSubCode] = [(longCode)]
            else:
                masterDb[codeSubCode].append(longCode)

此解决方案可以在大约10秒内处理数据(实际上是两次)。