我真的是python的初学者,但是试图将从两个数据库中提取的一些数据比较到文件中。在脚本中,我使用每个数据库内容的字典,如果找到差异,我将其添加到字典中。它们是前两个值(代码和子代码)的组合,值是与该代码/子代码组合相关联的longCodes列表。总的来说,我的剧本有效,但如果它的构造非常可怕且效率低下,我不会感到惊讶。正在处理的样本数据如下:
0,0,83
0,1,157
1,1,158
1,2,159
1,3,210
2,0,211
2,1,212
2,2,213
2,2,214
2,2,215
这个想法是数据应该同步,但有时它不是,我试图检测差异。实际上,当我从DB中提取数据时,每个文件中有超过100万行。性能看起来不是那么好(也许它可以做得很好?),需要大约35分钟来处理并给我结果。如果有任何提高性能的建议,我会很乐意接受!
import difflib, sys, csv, collections
masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
diff = difflib.ndiff(f1.readlines(),f2.readlines())
for line in diff:
if line.startswith('-'):
line = line[2:]
codeSubCode = ",".join(line.split(",", 2)[:2])
longCode = ",".join(line.split(",", 2)[2:]).rstrip()
if not codeSubCode in masterDb:
masterDb[codeSubCode] = [(longCode)]
else:
masterDb[codeSubCode].append(longCode)
elif line.startswith('+'):
line = line[2:]
codeSubCode = ",".join(line.split(",", 2)[:2])
longCode = ",".join(line.split(",", 2)[2:]).rstrip()
if not codeSubCode in slaveDb:
slaveDb[codeSubCode] = [(longCode)]
else:
slaveDb[codeSubCode].append(longCode)
f1.close()
f2.close()
答案 0 :(得分:1)
试试这个:
import difflib, sys, csv, collections
masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
diff = difflib.ndiff(f1.readlines(),f2.readlines())
for line in diff:
if line.startswith('-'):
line = line[2:]
sp=",".join(line.split(",", 2)[:2])
codeSubCode = sp
longCode = sp.rstrip()
try:
masterDb[codeSubCode].append(longCode)
except:
masterDb[codeSubCode] = [(longCode)]
elif line.startswith('+'):
line = line[2:]
sp=",".join(line.split(",", 2)[:2])
codeSubCode = sp
longCode = sp.rstrip()
try:
slaveDb[codeSubCode].append(longCode)
except:
slaveDb[codeSubCode] = [(longCode)]
f1.close()
f2.close()
答案 1 :(得分:0)
所以我最终使用不同的逻辑来提出一个更高效的脚本。非常感谢https://stackoverflow.com/users/100297/martijn-pieters提供帮助。
#!/usr/bin/python
import csv, sys, collections
masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
outFile = open('results.csv', 'wb')
#First find entries in SLAVE that dont match MASTER
with open('masterDbCodes.lst', 'rb') as master:
reader1 = csv.reader(master)
master_rows = {tuple(r) for r in reader1}
with open('slaveDbCodes.lst', 'rb') as slave:
reader = csv.reader(slave)
for row in reader:
if tuple(row) not in master_rows:
code = row[0]
subCode = row[1]
codeSubCode = ",".join([code, subCode])
longCode = row[2]
if not codeSubCode in slaveDb:
slaveDb[codeSubCode] = [(longCode)]
else:
slaveDb[codeSubCode].append(longCode)
#Now find entries in MASTER that dont match SLAVE
with open('slaveDbCodes.lst', 'rb') as slave:
reader1 = csv.reader(slave)
slave_rows = {tuple(r) for r in reader1}
with open('masterDbCodes.lst', 'rb') as master:
reader = csv.reader(master)
for row in reader:
if tuple(row) not in slave_rows:
code = row[0]
subCode = row[1]
codeSubCode = ",".join([code, subCode])
longCode = row[2]
if not codeSubCode in masterDb:
masterDb[codeSubCode] = [(longCode)]
else:
masterDb[codeSubCode].append(longCode)
此解决方案可以在大约10秒内处理数据(实际上是两次)。