比较两列(在两个文件中),然后打印相似的行和不同的行

时间:2018-05-03 23:52:02

标签: python csv itertools

我有两个文件,每个文件包含两列,在第一列中,这些行通常是相似的。

file1.csv

C(2)—C(1) 1.5183
C(3)—C(2) 1.49
C(3)—C(1) 1.4991
O(4)—C(3) 1.4104
H(10)—O(4) 0.964
C(2)—C(1)—C(3) 59.19
C(3)—C(1)—H(5) 118.4

file2.csv

C(2)—C(1) 1.5052
C(3)—C(2) 1.505
C(3)—C(1) 1.5037
S(4)—C(3) 1.7976
H(10)—S(4) 1.3445
C(2)—C(1)—H(6) 117.68
C(2)—C(1)—C(3) 60.3
C(3)—C(1)—H(5) 116.99

这是一个python脚本“使用itertools”,它比较file1.csv和file2.csv中的第一个colone,然后打印相似的行。

import itertools

files = ['file1.csv', 'file2.csv']
d = {}

for fi, f in enumerate(files):
    fh = open(f)
    for line in fh:
        sl = line.split()
        name = sl[0]
        val = float(sl[1])
        if name not in d:
            d[name] = {}
        if fi not in d[name]:
            d[name][fi] = []
        d[name][fi].append(val)
    fh.close()

for name, vals in d.items():
    if len(vals) == len(files):
        for var in itertools.product(*vals.values()):
            if max(var) - min(var) <= 20:
                out = '{}\t{}'.format(name, "\t".join(map(str, var)))
                print(out)
                break

output.csv

C(2)-C(1)        1.5183   1.5052
C(3)-C(2)        1.49     1.505
C(3)-C(1)        1.4991   1.5037
C(2)-C(1)-C(3)   59.19    60.3
C(3)-C(1)-H(5)   118.4    116.99

但是我也没有想到打印不同的行。

我想要的输出:

similar_lines
C(2)-C(1)           1.5183    1.5052
C(3)-C(2)           1.49      1.505
C(3)-C(1)           1.4991    1.5037
C(2)-C(1)-C(3)      59.19     60.3
C(3)-C(1)-H(5)      118.4     116.99

different_lines
O(4)-C(3)           1.4104      non
H(10)-O(4)          0.964       non
S(4)-C(3)            non       1.7976
H(10)-S(4)           non       1.3445
C(2)-C(1)-H(6)       non       117.68

1 个答案:

答案 0 :(得分:1)

您可以使用itertools.groupby

import itertools, csv
file1 = [i+[True] for i in list(csv.reader(open('filename1.csv')))]
file2 = [i+[False] for i in list(csv.reader(open('filename2.csv')))]
new_data = [[a, list(b)] for a, b in itertools.groupby(sorted(file1+file2, key=lambda x:x[0]), key=lambda x:x[0])]
similar = ['{} {}'.format(a, '     '.join(h for _, h, flag in b)) for a, b in new_data if len(b) > 1]
different = ['{} {}'.format(a, 'non  {}'.format(b[0][1]) if not b[0][-1] else '{}     non'.format(b[0][1])) for a, b in new_data if len(b) == 1]
last_output = 'similar_lines\n{}\n\ndifferent_lines\n{}'.format('\n'.join(similar), '\n'.join(different))

输出:

similar_lines
C (2)-C(1) 1.5183     1.5052
C (2)-C(1)-C(3) 59.19     60.3
C (3)-C(1) 1.4991     1.5037
C (3)-C(1)-H(5) 118.4     116.99
C (3)-C(2) 1.49     1.505

different_lines
C (2)-C(1)-H(6) non  117.68
H (10)-O(4) 0.964     non
H (10)-S(4) non  1.3445
O (4)-C(3) 1.4104     non
S (4)-C(3) non  1.7976