我有两个文件,每个文件包含两列,在第一列中,这些行通常是相似的。
file1.csv
C(2)—C(1) 1.5183
C(3)—C(2) 1.49
C(3)—C(1) 1.4991
O(4)—C(3) 1.4104
H(10)—O(4) 0.964
C(2)—C(1)—C(3) 59.19
C(3)—C(1)—H(5) 118.4
file2.csv
C(2)—C(1) 1.5052
C(3)—C(2) 1.505
C(3)—C(1) 1.5037
S(4)—C(3) 1.7976
H(10)—S(4) 1.3445
C(2)—C(1)—H(6) 117.68
C(2)—C(1)—C(3) 60.3
C(3)—C(1)—H(5) 116.99
这是一个python脚本“使用itertools”,它比较file1.csv和file2.csv中的第一个colone,然后打印相似的行。
import itertools
files = ['file1.csv', 'file2.csv']
d = {}
for fi, f in enumerate(files):
fh = open(f)
for line in fh:
sl = line.split()
name = sl[0]
val = float(sl[1])
if name not in d:
d[name] = {}
if fi not in d[name]:
d[name][fi] = []
d[name][fi].append(val)
fh.close()
for name, vals in d.items():
if len(vals) == len(files):
for var in itertools.product(*vals.values()):
if max(var) - min(var) <= 20:
out = '{}\t{}'.format(name, "\t".join(map(str, var)))
print(out)
break
output.csv
C(2)-C(1) 1.5183 1.5052
C(3)-C(2) 1.49 1.505
C(3)-C(1) 1.4991 1.5037
C(2)-C(1)-C(3) 59.19 60.3
C(3)-C(1)-H(5) 118.4 116.99
但是我也没有想到打印不同的行。
我想要的输出:
similar_lines
C(2)-C(1) 1.5183 1.5052
C(3)-C(2) 1.49 1.505
C(3)-C(1) 1.4991 1.5037
C(2)-C(1)-C(3) 59.19 60.3
C(3)-C(1)-H(5) 118.4 116.99
different_lines
O(4)-C(3) 1.4104 non
H(10)-O(4) 0.964 non
S(4)-C(3) non 1.7976
H(10)-S(4) non 1.3445
C(2)-C(1)-H(6) non 117.68
答案 0 :(得分:1)
您可以使用itertools.groupby
:
import itertools, csv
file1 = [i+[True] for i in list(csv.reader(open('filename1.csv')))]
file2 = [i+[False] for i in list(csv.reader(open('filename2.csv')))]
new_data = [[a, list(b)] for a, b in itertools.groupby(sorted(file1+file2, key=lambda x:x[0]), key=lambda x:x[0])]
similar = ['{} {}'.format(a, ' '.join(h for _, h, flag in b)) for a, b in new_data if len(b) > 1]
different = ['{} {}'.format(a, 'non {}'.format(b[0][1]) if not b[0][-1] else '{} non'.format(b[0][1])) for a, b in new_data if len(b) == 1]
last_output = 'similar_lines\n{}\n\ndifferent_lines\n{}'.format('\n'.join(similar), '\n'.join(different))
输出:
similar_lines
C (2)-C(1) 1.5183 1.5052
C (2)-C(1)-C(3) 59.19 60.3
C (3)-C(1) 1.4991 1.5037
C (3)-C(1)-H(5) 118.4 116.99
C (3)-C(2) 1.49 1.505
different_lines
C (2)-C(1)-H(6) non 117.68
H (10)-O(4) 0.964 non
H (10)-S(4) non 1.3445
O (4)-C(3) 1.4104 non
S (4)-C(3) non 1.7976