使用Python - 两个文件之间的比较和以格式化方式打印包含不匹配形式的结果

时间:2017-12-28 09:40:17

标签: python python-3.x python-2.7

我的文件如下:

文件1:

COL1|COL2|COL3|COL4|COL5

'SR'|'2017-09-01 00:19:13'|'+05:30'|'1A3LA7015L5S'|'5042449536906016501541'

'SR'|'2017-09-01 00:19:20'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701550'

'SR'|'2017-09-01 00:19:23'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701555'

文件2:

COL1|COL2|COL3|COL4|COL5

'SR'|'2017-09-01 00:19:13'|'+05:30'|'1A3LA7015L5Q'|'5042449536906016501541'

'SR'|'2017-09-01 00:19:20'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701550'

'SR'|'2017-09-01 00:19:20'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701555'

此处主键是我的第5列。

  1. 在2个文件的比较之后我想要的输出如下:

    PrimaryKey|Column|File1Value|File2Value
    '5042449536906016501541'|COL4|'1A3LA7015L5S'|'1A3LA7015L5Q'
    '5042449603146028701555'|COL2|'2017-09-01 00:19:23'|'2017-09-01 00:19:20'
    
  2. 它应该按照上面给出的格式列出它所发生的列中的不匹配

  3. 尝试使用下面的代码,但这只适用于两个文件中只有相似行数并且只发现单元格级别不匹配的情况..但我想处理源文件中缺少的内容,目标中缺少文件,并处理文件中的重复,然后从常见的文件找出不匹配.. plzz帮助

    import sys
    import csv
    import datetime
    import time
    import os
    from operator import itemgetter
    if len(sys.argv) !=3 :    
      print "invalid params"
      exit
    elif len(sys.argv) == 3:
      ts = time.time()
      st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d-%H:%M:%S')
      os.makedirs(st)
      os.chdir(st)
      d = '|'    # we can change delimiter here
      rslt = open('Comp_Result','w')
      stgt = open('sort_tgt','wr')
      read1 = csv.reader(open(sys.argv[1],'rb'),delimiter=d)
      read2 = csv.reader(open(sys.argv[2],'rb'),delimiter=d)
      sort_src = sorted(read1, key=itemgetter(0))
      sort_tgt = sorted(read2, key=itemgetter(0))
      f=open(sys.argv[1],'r')
      reader=csv.reader(f,delimiter=d)
      num_cols = len(next(reader)) # Read first line and count columns
      f.seek(0)
      num_lines=0
      rslt.write('Key_col|col_num|src_value|tgt_value')
      rslt.write('\n ********************************************\n')
      for trg_line in sort_tgt:
        for i in range(0, num_cols):
          stgt.write(trg_line[i])
          stgt.write('|')
        stgt.write('\n')
        num_lines = num_lines + 1
      stgt.close()
      stgt_file=open('sort_tgt','r')
      read_tgt = csv.reader(stgt_file,delimiter=d)
      check_point=1
      stgt_file.seek(0)
      tgt_line = next(read_tgt)
      #stgt_file.seek(0)
     for src_line in sort_src:  
        while(src_line[0]>=tgt_line[0] and check_point <= num_lines):
          check_point = check_point + 1
          if  src_line[0]==tgt_line[0]:
             #check_point = check_point + 1
             for i in range(1, num_cols):
               if src_line[i]!=tgt_line[i]:
                    col_num = str(i + 1)
                rslt.write(src_line[0])
                rslt.write('|')
                rslt.write(col_num)
                rslt.write('|')
                rslt.write(src_line[i])
                rslt.write('|')
                rslt.write(tgt_line[i])
                rslt.write('\n')
          prev_line = tgt_line
          if check_point <= num_lines:
            tgt_line = next(read_tgt)
    
      print '\n\n**************************** \n comparison done,         \n************************** \n Results are in Comp_Result file at below     folder:'
      print st
      print ' \n\n'
    

1 个答案:

答案 0 :(得分:0)

您可以使用pandasnumpy,如下所示:

import pandas as pd
import numpy as np

#1
csv_1 = '48005038-1.csv'
df1 = pd.read_csv(filepath_or_buffer=csv_1, sep='|', index_col=4)

csv_2 = '48005038-2.csv'
df2 = pd.read_csv(filepath_or_buffer=csv_2, sep='|', index_col=4)

#2
ne_stacked = (df1 != df2).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['COL5', 'col']

#3
diff = np.where(df1 != df2)
changed_from = df1.values[diff]
changed_to = df2.values[diff]

#4
diff = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
print(diff)
  1. 将您的csv文件作为pandas dataframe
  2. 读取
  3. 获取每行不同的列
  4. 更改后获取之前的信息。
  5. 使用我们从步骤#2获得的索引将diff数据转换回dataframe。
  6. 输出结果为:

                                                    from                     to
    COL5                     col
    '5042449536906016501541' COL4         '1A3LA7015L5S'         '1A3LA7015L5Q'
    '5042449603146028701555' COL2  '2017-09-01 00:19:23'  '2017-09-01 00:19:20'
    

    我认为您可以轻松转换为您想要的格式。