我需要用熊猫重现SQL如此轻松地完成的事情:
select
del_month
, sum(case when off0_on1 = 1 then 1 else 0 end) as on1
, sum(case when off0_on1 = 0 then 1 else 0 end) as off0
from a1
group by del_month
order by del_month
这是一个示例性的说明性熊猫数据框,可用于:
a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 'off0_on1':[0,0,1,1,0,1,1,1]})
这是我尝试用熊猫重现上述SQL的尝试。第一行有效。第二行显示错误:
a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(sum)
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(sum(lambda x: 1 if x == 0 else 0))
这是第二行的错误:
TypeError: 'function' object is not iterable
此previous question of mine的lambda函数有问题,已解决。更大的问题是如何在分组数据上重现SQL的“ sum(case when)”逻辑。我正在寻找一个通用的解决方案,因为我需要经常做这种事情。我上一个问题的答案建议在lambda函数内部使用map(),但是“ off0”列的以下结果不是我所需要的。我想要的是“ on1”列。整个组的答案应该相同(即“ del_month”)。
答案 0 :(得分:4)
只需在条件逻辑表达式中求和:
import pandas as pd
a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2],
'off0_on1':[0,0,1,1,0,1,1,1]})
a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==1))
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==0))
print(a1)
# del_month off0_on1 on1 off0
# 0 1 0 2 2
# 1 1 0 2 2
# 2 1 1 2 2
# 3 1 1 2 2
# 4 2 0 3 1
# 5 2 1 3 1
# 6 2 1 3 1
# 7 2 1 3 1
类似地,如果方言支持的话,您可以在SQL中执行相同的操作,
select
del_month
, sum(off0_on1 = 1) as on1
, sum(off0_on1 = 0) as off0
from a1
group by del_month
order by del_month
要在熊猫中复制以上SQL,请不要使用transform
,而应在groupby().apply()
调用中发送多个聚合:
def aggfunc(x):
data = {'on1': sum(x['off0_on1'] == 1),
'off0': sum(x['off0_on1'] == 0)}
return pd.Series(data)
g = a1.groupby('del_month').apply(aggfunc)
print(g)
# on1 off0
# del_month
# 1 2 2
# 2 3 1
答案 1 :(得分:2)
使用get_dummies
仅需要一个groupby
调用,这更简单。
v = pd.get_dummies(df.pop('off0_on1')).groupby(df.del_month).transform(sum)
df = pd.concat([df, v.rename({0: 'off0', 1: 'on1'}, axis=1)], axis=1)
df
del_month off0 on1
0 1 2 2
1 1 2 2
2 1 2 2
3 1 2 2
4 2 1 3
5 2 1 3
6 2 1 3
7 2 1 3
此外,对于汇总的情况,直接调用sum
而不是使用apply
:
(pd.get_dummies(df.pop('off0_on1'))
.groupby(df.del_month)
.sum()
.rename({0: 'off0', 1: 'on1'}, axis=1))
off0 on1
del_month
1 2 2
2 1 3