我需要比较两个Excel并求和所有具有相同键值的实际值。
example sheet.
sheet 1 | sheet 2
index id count | index id name
1 a 12 | 1 a qg1
2 b 15 | 2 c ff2
3 c 21 | 3 f dv1
4 b 5 | 4 b bm5
. .
. .
在上述情况下,我引用了sheet2并求和了sheet1中具有相同ID的值的实际值(计数)。(id a | 100,id b | 20 ...)
下面的代码花费了太长时间,因为每个ID都已编入索引。
import pandas as pd
import csv
pcode_quantity = pd.read_csv('/1.csv',delimiter=',')
product_info = pd.read_csv('/2.csv' , delimiter=',')
product_list = product_info.id.tolist()
purchase_id = pcode_quantity.id.tolist()
purchase_count = pcode_quantity['count'].tolist()
product_sum = 0
i =0
i2 = 0
product_lenth =len(product_list)
purchase_lenth = len(purchase_id)
dict_pcode = {}
while product_lenth > i:
while purchase_lenth > i2:
if product_list[i] == purchase_id[i2]:
product_sum = product_sum + purchase_count[i2]
i2=i2+1
dict_pcode[product_list[i]]=product_sum
product_sum = 0
i2= 0
i= i+1
sum_pcode = pd.DataFrame(list(dict_pcode.items()))
sum_pcode.to_csv('/output.csv')
是否有任何代码可以加快上述操作的速度?
答案 0 :(得分:1)
您可以先将sum
乘以groupby
,然后再聚合join
product_info
,再用DataFrame.fillna
替换可能的缺失值,最后将其用于字典set_index
通过astype
和最后一个to_dict
转换为整数:
pcode_quantity = pcode_quantity.groupby('id')['count'].sum()
df = product_info.join(pcode_quantity, on='id').fillna({'count': 0})
print (df)
id name count
index
1 a qg1 12.0
2 c ff2 21.0
3 f dv1 0.0
4 b bm5 20.0
dict_pcode = df.set_index('id')['count'].astype(int).to_dict()
print (dict_pcode)
{'a': 12, 'c': 21, 'f': 0, 'b': 20}