python比较CSV并找到差异

时间:2018-05-10 18:27:07

标签: python csv comparison

我有两个CSV:

CSV1

id, count
1, 5
2, 10
100, 1

CSV2

id, count
100, 5
1, 10
2, 1

我需要将CSV与 id 进行比较并获取两者的计数并找出差异。我的预期结果是:

id, Diff
1, -5
100, -4
2, 9

现在我使用嵌套循环:

with open('csv1.csv', 'r') as t1, open('csv2.csv', 'r') as t2:
fileone = csv.DictReader(t1)
filetwo = csv.DictReader(t2)
csv1 = list(fileone)
csv2 = list(filetwo)
for data in csv1:
    for datum in csv2:
        if data['id'] == datum['id']:
            diff = int(data['count']) - int(datum['count'])

            if diff > 0:
                print(diff)
                item = [[
                    str(data['id']),
                   str(data['count']),
                  str(datum['count']),
                  str(diff)]]
                writer.writerows(item)

但是由于上面的代码在循环中执行循环,如果我有大文件O(n ^ 2),则需要永远。无论如何,我可以轻松地在python中进行比较。

3 个答案:

答案 0 :(得分:1)

O(n**2)代码:

fileone = csv.DictReader(t1)
filetwo = csv.DictReader(t2)
csv1 = list(fileone)
csv2 = list(filetwo)
for data in csv1:
    for datum in csv2:
        if data['id'] == datum['id']:
           diff = int(data['count']) - int(datum['count'])
           ...
可以使用id字段作为键创建2个字典来替换

,然后执行键的交集。然后在相交的键上循环:

csv1 = {data["id"]:data for data in fileone}
csv2 = {data["id"]:data for data in filetwo}
keys = set(csv1) & csv2
for k in keys:
    data = csv1[k]
    datum = csv2[k]
    diff = int(data['count']) - int(datum['count'])
    ...

现在你的复杂度约为O(n)(dict查找的平均值为O(1)

答案 1 :(得分:1)

尝试pandas

import pandas as pd
df1 = pd.read_csv('csv1.csv', index_col='id')
df2 = pd.read_csv('csv2.csv', index_col='id')
df_diff = df1-df2
print(df1)
print(df2)
print(df_diff)

输出:

      count
id         
1         5
2        10
100       1
      count
id         
100       5
1        10
2         1
      count
id         
1        -5
2         9
100      -4

Pandas将为您处理索引对齐(id),并将使用已编译的numpy算法来进行更多更快的计算。

答案 2 :(得分:0)

如果你想尝试Pandas。

import pandas as pd
df1 = pd.read_csv('csv1.csv',names=['id','count_1'])
df2=pd.read_csv('csv2.csv',names=['id','count_2'])

df_merged=df1.merge(df2,on='id')
df_merged['diff'] = df1.count_1 - df2.count_2