根据另一个大熊猫

时间:2017-05-22 05:46:15

标签: python pandas

从技术上讲,这应该是一件简单的事情,但不幸的是,它目前无法理解。

我试图根据另一列找到另一列的比例。例如:

Column 1   |  target_variable
'potato'         1
'potato'         0
'tomato'         1
'brocolli'       1
'tomato'         0

预期输出将是:

column 1   | target = 1  | target = 0 | total_count
'potato'   |     1       |      1     |     2
'tomato'   |     1       |      1     |     2
'brocolli' |     1       |      0     |     1

但是,我认为我错误地使用了聚合,所以我采用以下天真实现:

z = {}
for i in train.index:
    fruit = train["fruit"][i]
    l = train["target"][i]
    if fruit not in z:
        if l == 1:
            z[fruit] = {1:1,0:0,'count':1}
        else:
            z[fruit] = {1:0,0:1,'count':1}
    else:
        if l == 1:
            z[fruit][1] += 1
        else:
            z[fruit][0] += 1
        z[fruit]['count'] += 1

相反,它以字典形式提供类似的输出。

任何人都可以通过pandas方式的正确语法来启发我吗? :)

谢谢! :)

2 个答案:

答案 0 :(得分:4)

您需要groupby + size + unstack + add_prefix + sum

df1 = df.groupby(['Column 1','target_variable']).size() \
        .unstack(fill_value=0) \
        .add_prefix('target = ')
df1['total_count'] = df1.sum(axis=1)
print (df1)
target_variable  target = 0  target = 1  total_count
Column 1                                            
brocolli                  0           1            1
potato                    1           1            2
tomato                    1           1            2

crosstab

df1 = pd.crosstab(df['Column 1'],df['target_variable'], margins=True)
print (df1)
target_variable  0  1  All
Column 1                  
brocolli         0  1    1
potato           1  1    2
tomato           1  1    2
All              2  3    5

df1 = df1.rename(columns = {'All': 'total_count'}).iloc[:-1]
print (df1)
target_variable  0  1  total_count
Column 1                          
brocolli         0  1            1
potato           1  1            2
tomato           1  1            2

答案 1 :(得分:1)

让我们使用get_dummiesadd_prefixgroupby

df = df.assign(**df['target_variable'].astype(str).str.get_dummies().add_prefix('target = '))
df['total_count'] = df.drop('target_variable', axis=1).sum(axis=1)
df.groupby('Column 1').sum()

输出:

            target_variable  target = 0  target = 1  total_count
Column 1                                                        
'brocolli'                1           0           1            1
'potato'                  1           1           1            2
'tomato'                  1           1           1            2