我有两个文本文件,每个文件大约1GB,其中每行有60列。 有6列是每个文件中要比较的键。
示例:
文件1 :
4|null|null|null|null|null|3590740374739|20077|7739662|75414741|
file2的:
4|null|11|333|asdsd|null|3590740374739|20077|7739662|75414741|
这里两条线相等,因为列7,8,9和10在两个文件(键)中是相同的。 我尝试了一个样本来比较文件,而不考虑键,这工作正常,但我需要根据键进行比较,而不是每行中的字符与字符。
以下是我在不考虑密钥的情况下进行比较的代码示例。
matched = open('matchedrecords.txt','w')
with open('srcone.txt') as b:
blines = set(b)
with open('srctwo.txt') as a:
alines = set(a)
with open('notInfirstSource.txt', 'w') as result:
for line in alines:
if line not in blines:
result.write(line)
else:
matched.write(line)
with open('notInsecondSource.txt', 'w') as non:
for lin in blines:
if lin not in alines:
non.write(lin)
matched.close()
答案 0 :(得分:0)
这是您可以根据键/列比较行的方法之一,但我不确定它的效率。
matched =open('matchedrecords.txt','w')
with open('srcone.txt') as b:
blines = set(b)
with open('srctwo.txt') as a:
alines= set(a)
# List of columns or keys to compare
list_of_columns_to_compare=[7,8,9]
a_columns=[]
b_columns=[]
for blin in blines :
for alin in alines:
for column_no in list_of_columns_to_compare :
# Appending columns to a list to compare
b_columns.append(blin.split('|')[column_no])
a_columns.append(alin.split('|')[column_no])
if a_columns == b_columns:
matched.write(blin + " = " + alin)
答案 1 :(得分:0)
从ActiveState上的recipe for KeyedSets获取提示,您可以构建一个集合,然后只需使用set intersection和set difference来产生结果:
import collections
class Set(collections.Set):
@staticmethod
def key(s): return tuple(s.split('|')[6:10])
def __init__(self, it): self._dict = {self.key(s):s for s in it}
def __len__(self): return len(self._dict)
def __iter__(self): return self._dict.itervalues()
def __contains__(self, value): return self.key(value) in self._dict
data = {}
for filename in 'srcone.txt', 'srctwo.txt':
with open(filename) as f:
data[filename] = Set(f)
with open('notInFirstSource.txt', 'w') as f:
for lines in data['srctwo.txt'] - data['srcone.txt']:
f.write(''.join(lines))
with open('notInSecondSource.txt', 'w') as f:
for lines in data['srcone.txt'] - data['srctwo.txt']:
f.write(''.join(lines))
with open('matchedrecords.txt', 'w') as f:
for lines in data['srcone.txt'] & data['srctwo.txt']:
f.write(''.join(lines))
答案 2 :(得分:0)
最后,我可以在很短的时间内使用词典实现这一目标。 即370 MB数据与50 MB最大值(使用元组作为键)的270 MB数据文件相比较。 这是脚本:
reader = open("fileA",'r')
reader2 = open("fileB",'r')
TmpDict ={}
TmpDict2={}
for line in reader:
line = line.strip()
TmpArr=line.split('|')
#Forming a dictionary with below columns as keys
TmpDict[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
for line in reader2:
line = line.strip()
TmpArr=line.split('|')
TmpDict2[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
outfile = open('MatchedRecords.txt', 'w')
outfileNonMatchedB=open('notInB','w')
outfileNonMatchedA=open('notInA','w')
for k,v in TmpDict.iteritems():
if k in TmpDict2:
outfile.write(v+ '\n')
else:
outfileNonMatchedB.write(v+'\n')
outfile.close()
outfileNonMatchedB.close()
for k,v in TmpDict2.iteritems():
if k not in TmpDict:
outfileNonMatchedA.write(v+'\n')
outfileNonMatchedA.close()
可以对此进行任何改进吗?建议我! 感谢