我的文件如下:
文件1:
COL1|COL2|COL3|COL4|COL5
'SR'|'2017-09-01 00:19:13'|'+05:30'|'1A3LA7015L5S'|'5042449536906016501541'
'SR'|'2017-09-01 00:19:20'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701550'
'SR'|'2017-09-01 00:19:23'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701555'
文件2:
COL1|COL2|COL3|COL4|COL5
'SR'|'2017-09-01 00:19:13'|'+05:30'|'1A3LA7015L5Q'|'5042449536906016501541'
'SR'|'2017-09-01 00:19:20'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701550'
'SR'|'2017-09-01 00:19:20'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701555'
此处主键是我的第5列。
在2个文件的比较之后我想要的输出如下:
PrimaryKey|Column|File1Value|File2Value
'5042449536906016501541'|COL4|'1A3LA7015L5S'|'1A3LA7015L5Q'
'5042449603146028701555'|COL2|'2017-09-01 00:19:23'|'2017-09-01 00:19:20'
它应该按照上面给出的格式列出它所发生的列中的不匹配
尝试使用下面的代码,但这只适用于两个文件中只有相似行数并且只发现单元格级别不匹配的情况..但我想处理源文件中缺少的内容,目标中缺少文件,并处理文件中的重复,然后从常见的文件找出不匹配.. plzz帮助
import sys
import csv
import datetime
import time
import os
from operator import itemgetter
if len(sys.argv) !=3 :
print "invalid params"
exit
elif len(sys.argv) == 3:
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d-%H:%M:%S')
os.makedirs(st)
os.chdir(st)
d = '|' # we can change delimiter here
rslt = open('Comp_Result','w')
stgt = open('sort_tgt','wr')
read1 = csv.reader(open(sys.argv[1],'rb'),delimiter=d)
read2 = csv.reader(open(sys.argv[2],'rb'),delimiter=d)
sort_src = sorted(read1, key=itemgetter(0))
sort_tgt = sorted(read2, key=itemgetter(0))
f=open(sys.argv[1],'r')
reader=csv.reader(f,delimiter=d)
num_cols = len(next(reader)) # Read first line and count columns
f.seek(0)
num_lines=0
rslt.write('Key_col|col_num|src_value|tgt_value')
rslt.write('\n ********************************************\n')
for trg_line in sort_tgt:
for i in range(0, num_cols):
stgt.write(trg_line[i])
stgt.write('|')
stgt.write('\n')
num_lines = num_lines + 1
stgt.close()
stgt_file=open('sort_tgt','r')
read_tgt = csv.reader(stgt_file,delimiter=d)
check_point=1
stgt_file.seek(0)
tgt_line = next(read_tgt)
#stgt_file.seek(0)
for src_line in sort_src:
while(src_line[0]>=tgt_line[0] and check_point <= num_lines):
check_point = check_point + 1
if src_line[0]==tgt_line[0]:
#check_point = check_point + 1
for i in range(1, num_cols):
if src_line[i]!=tgt_line[i]:
col_num = str(i + 1)
rslt.write(src_line[0])
rslt.write('|')
rslt.write(col_num)
rslt.write('|')
rslt.write(src_line[i])
rslt.write('|')
rslt.write(tgt_line[i])
rslt.write('\n')
prev_line = tgt_line
if check_point <= num_lines:
tgt_line = next(read_tgt)
print '\n\n**************************** \n comparison done, \n************************** \n Results are in Comp_Result file at below folder:'
print st
print ' \n\n'
答案 0 :(得分:0)
您可以使用pandas
和numpy
,如下所示:
import pandas as pd
import numpy as np
#1
csv_1 = '48005038-1.csv'
df1 = pd.read_csv(filepath_or_buffer=csv_1, sep='|', index_col=4)
csv_2 = '48005038-2.csv'
df2 = pd.read_csv(filepath_or_buffer=csv_2, sep='|', index_col=4)
#2
ne_stacked = (df1 != df2).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['COL5', 'col']
#3
diff = np.where(df1 != df2)
changed_from = df1.values[diff]
changed_to = df2.values[diff]
#4
diff = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
print(diff)
输出结果为:
from to
COL5 col
'5042449536906016501541' COL4 '1A3LA7015L5S' '1A3LA7015L5Q'
'5042449603146028701555' COL2 '2017-09-01 00:19:23' '2017-09-01 00:19:20'
我认为您可以轻松转换为您想要的格式。