我有两个CSV:
CSV1
id, count
1, 5
2, 10
100, 1
CSV2
id, count
100, 5
1, 10
2, 1
我需要将CSV与 id 进行比较并获取两者的计数并找出差异。我的预期结果是:
id, Diff
1, -5
100, -4
2, 9
现在我使用嵌套循环:
with open('csv1.csv', 'r') as t1, open('csv2.csv', 'r') as t2:
fileone = csv.DictReader(t1)
filetwo = csv.DictReader(t2)
csv1 = list(fileone)
csv2 = list(filetwo)
for data in csv1:
for datum in csv2:
if data['id'] == datum['id']:
diff = int(data['count']) - int(datum['count'])
if diff > 0:
print(diff)
item = [[
str(data['id']),
str(data['count']),
str(datum['count']),
str(diff)]]
writer.writerows(item)
但是由于上面的代码在循环中执行循环,如果我有大文件O(n ^ 2),则需要永远。无论如何,我可以轻松地在python中进行比较。
答案 0 :(得分:1)
此O(n**2)
代码:
fileone = csv.DictReader(t1)
filetwo = csv.DictReader(t2)
csv1 = list(fileone)
csv2 = list(filetwo)
for data in csv1:
for datum in csv2:
if data['id'] == datum['id']:
diff = int(data['count']) - int(datum['count'])
...
可以使用id字段作为键创建2个字典来替换,然后执行键的交集。然后在相交的键上循环:
csv1 = {data["id"]:data for data in fileone}
csv2 = {data["id"]:data for data in filetwo}
keys = set(csv1) & csv2
for k in keys:
data = csv1[k]
datum = csv2[k]
diff = int(data['count']) - int(datum['count'])
...
现在你的复杂度约为O(n)
(dict查找的平均值为O(1)
)
答案 1 :(得分:1)
尝试pandas
:
import pandas as pd
df1 = pd.read_csv('csv1.csv', index_col='id')
df2 = pd.read_csv('csv2.csv', index_col='id')
df_diff = df1-df2
print(df1)
print(df2)
print(df_diff)
输出:
count
id
1 5
2 10
100 1
count
id
100 5
1 10
2 1
count
id
1 -5
2 9
100 -4
Pandas将为您处理索引对齐(id
),并将使用已编译的numpy
算法来进行更多更快的计算。
答案 2 :(得分:0)
如果你想尝试Pandas。
import pandas as pd
df1 = pd.read_csv('csv1.csv',names=['id','count_1'])
df2=pd.read_csv('csv2.csv',names=['id','count_2'])
df_merged=df1.merge(df2,on='id')
df_merged['diff'] = df1.count_1 - df2.count_2