Python用于比较基于多个键的大型文本文件

时间:2014-09-30 06:33:20

标签: python

我有两个文本文件,每个文件大约1GB,其中每行有60列。 有6列是每个文件中要比较的键。

示例:

  

文件1 :   4|null|null|null|null|null|3590740374739|20077|7739662|75414741|

     

file2的:   4|null|11|333|asdsd|null|3590740374739|20077|7739662|75414741|

这里两条线相等,因为列7,8,9和10在两个文件(键)中是相同的。 我尝试了一个样本来比较文件,而不考虑键,这工作正常,但我需要根据键进行比较,而不是每行中的字符与字符。

以下是我在不考虑密钥的情况下进行比较的代码示例。

matched = open('matchedrecords.txt','w')

with open('srcone.txt') as b:
  blines = set(b)

with open('srctwo.txt') as a:
  alines = set(a)

with open('notInfirstSource.txt', 'w') as result:
  for line in alines:
    if line not in blines:
      result.write(line)
    else:
      matched.write(line)       

with open('notInsecondSource.txt', 'w') as non:
    for lin in blines:
      if lin not in alines:
        non.write(lin)

matched.close()

3 个答案:

答案 0 :(得分:0)

这是您可以根据键/列比较行的方法之一,但我不确定它的效率。

 matched =open('matchedrecords.txt','w')
    with open('srcone.txt') as b:
      blines = set(b)
    with open('srctwo.txt') as a:
      alines= set(a)

        # List of columns or keys to compare
        list_of_columns_to_compare=[7,8,9]

        a_columns=[]
        b_columns=[]

        for blin in blines :
           for alin in alines:
               for column_no in list_of_columns_to_compare :
                   # Appending columns  to a list to compare
                   b_columns.append(blin.split('|')[column_no])
                   a_columns.append(alin.split('|')[column_no])

                   if a_columns == b_columns:
                       matched.write(blin + " = " + alin)

答案 1 :(得分:0)

从ActiveState上的recipe for KeyedSets获取提示,您可以构建一个集合,然后只需使用set intersection和set difference来产生结果:

import collections

class Set(collections.Set):
    @staticmethod
    def key(s): return tuple(s.split('|')[6:10])
    def __init__(self, it): self._dict = {self.key(s):s for s in it}
    def __len__(self): return len(self._dict)
    def __iter__(self): return self._dict.itervalues()
    def __contains__(self, value): return self.key(value) in self._dict

data = {}
for filename in 'srcone.txt', 'srctwo.txt':
    with open(filename) as f:
        data[filename] = Set(f)

with open('notInFirstSource.txt', 'w') as f:
    for lines in data['srctwo.txt'] - data['srcone.txt']:
        f.write(''.join(lines))

with open('notInSecondSource.txt', 'w') as f:
    for lines in data['srcone.txt'] - data['srctwo.txt']:
        f.write(''.join(lines))

with open('matchedrecords.txt', 'w') as f:
    for lines in data['srcone.txt'] & data['srctwo.txt']:
        f.write(''.join(lines))

答案 2 :(得分:0)

最后,我可以在很短的时间内使用词典实现这一目标。 即370 MB数据与50 MB最大值(使用元组作为键)的270 MB数据文件相比较。 这是脚本:

   reader = open("fileA",'r')
    reader2 = open("fileB",'r')
    TmpDict ={}
    TmpDict2={}
    for line in reader:
        line = line.strip()
        TmpArr=line.split('|')
       #Forming a dictionary with below columns as keys
        TmpDict[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
    for line in reader2:
        line = line.strip()
        TmpArr=line.split('|')
        TmpDict2[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
    outfile = open('MatchedRecords.txt', 'w')
    outfileNonMatchedB=open('notInB','w')
    outfileNonMatchedA=open('notInA','w')
    for k,v in TmpDict.iteritems():
        if k in TmpDict2:
            outfile.write(v+ '\n')
        else:
            outfileNonMatchedB.write(v+'\n')
    outfile.close()
    outfileNonMatchedB.close()
    for k,v in TmpDict2.iteritems():
        if k not in TmpDict:
            outfileNonMatchedA.write(v+'\n')
    outfileNonMatchedA.close()

可以对此进行任何改进吗?建议我! 感谢