我需要分别对每一列进行分组和分组,以找出一些指标。 假设我有一堆功能列和一个二进制目标列。每个功能都是一个bin(一个字符串)。目标是整数列。在这里,为简单起见,只有1和0。
示例
import pandas as pd
var1 = ['var1_bin1', 'var1_bin2', 'var1_bin2', 'var1_bin3', 'var1_bin4', 'var1_bin4', 'var1_bin4', 'var1_bin5', 'var1_bin5', 'var1_bin5']
var2 = ['var2_bin1', 'var2_bin1', 'var2_bin2', 'var2_bin3', 'var2_bin3', 'var2_bin4', 'var2_bin4', 'var2_bin5', 'var2_bin5', 'var2_bin5']
var3 = ['var3_bin2', 'var3_bin2', 'var3_bin2', 'var3_bin3', 'var3_bin3', 'var3_bin3', 'var3_bin3', 'var3_bin4', 'var3_bin5', 'var3_bin5']
var4 = ['var4_bin1', 'var4_bin1', 'var4_bin2', 'var4_bin2', 'var4_bin4', 'var4_bin4', 'var4_bin4', 'var4_bin4', 'var4_bin4', 'var4_bin4']
target = [1, 0, 0, 1, 1, 1, 0, 0, 0, 0]
df = pd.DataFrame({
'var1' : var1,
'var2' : var2,
'var3' : var3,
'target' : target
})
print(df)
cols = ['var1', 'var2', 'var3', 'var4', 'target']
# need groupby for each column separately:
# For each column, I want to group by categorical elements in column and sum elements from target variable and also count how many zeros are there
for col in cols:
x = df.groupby([col, target])[[target]].sum() #expecting aggregated metrics
print(x)
我期望的是,作为数据帧(或其他更好方法)的数据帧的结果,在视觉上我可以与您进行如下交流:
Result representation
var1 | var2 ...
---------------------------- |
bin | sum | total_zeros |
----------------- |
var1_bin1 | 1 | 0 |
var1_bin2 | 0 | 2 |
var1_bin3 | 1 | 0 |
var1_bin4 | 2 | 1 |
var1_bin5 | 0 | 3 |
答案 0 :(得分:3)
大熊猫答案
我们可以通过首先使用DataFrame.columns
使用for col in df.columns
遍历所有列来实现此目的
然后我们在这些列上GroupBy
,并使用GroupBy.agg
。在此汇总中,我们采用目标sum
和total zeros
。
最后,我们使用pd.concat
来使每个组彼此相邻。
dfg = pd.concat([
(df.groupby(col)['target']
.agg([(f'sum_{col}', 'sum'),(f'total_zeros_{col}', lambda x: x.eq(0).sum())])
.reset_index()
) for col in df.columns
], axis=1)
var1 sum_var1 total_zeros_var1 var2 sum_var2 total_zeros_var2 var3 sum_var3 total_zeros_var3 var4 sum_var4 total_zeros_var4 target sum_target total_zeros_target
0 var1_bin1 1 0 var2_bin1 1 1 var3_bin2 1.00 2.00 var4_bin1 1.00 1.00 0.00 0.00 6.00
1 var1_bin2 0 2 var2_bin2 0 1 var3_bin3 3.00 1.00 var4_bin2 1.00 1.00 1.00 4.00 0.00
2 var1_bin3 1 0 var2_bin3 2 0 var3_bin4 0.00 1.00 var4_bin4 2.00 4.00 nan nan nan
3 var1_bin4 2 1 var2_bin4 1 1 var3_bin5 0.00 2.00 NaN nan nan nan nan nan
4 var1_bin5 0 3 var2_bin5 0 3 NaN nan nan NaN nan nan nan nan nan
答案 1 :(得分:0)
因为性能很重要,所以要在0
之前而不是每个组中计算groupby
值,所以对于两个列的汇总sum
来说可能是计数:
df1 = pd.concat([
(df.assign(total_zeros = df[col].eq(0).astype(int))
.groupby(col)['target','total_zeros']
.sum()
.add_suffix(f'_{col}')
.reset_index()
) for col in df.columns
], axis=1)
print(df1)
var1 target_var1 total_zeros_var1 var2 target_var2 \
0 var1_bin1 1 0 var2_bin1 1
1 var1_bin2 0 0 var2_bin2 0
2 var1_bin3 1 0 var2_bin3 2
3 var1_bin4 2 0 var2_bin4 1
4 var1_bin5 0 0 var2_bin5 0
total_zeros_var2 var3 target_var3 total_zeros_var3 target \
0 0 var3_bin2 1.0 0.0 0.0
1 0 var3_bin3 3.0 0.0 1.0
2 0 var3_bin4 0.0 0.0 NaN
3 0 var3_bin5 0.0 0.0 NaN
4 0 NaN NaN NaN NaN
target_target total_zeros_target
0 0.0 6.0
1 4.0 0.0
2 NaN NaN
3 NaN NaN
4 NaN NaN