从技术上讲,这应该是一件简单的事情,但不幸的是,它目前无法理解。
我试图根据另一列找到另一列的比例。例如:
Column 1 | target_variable
'potato' 1
'potato' 0
'tomato' 1
'brocolli' 1
'tomato' 0
预期输出将是:
column 1 | target = 1 | target = 0 | total_count
'potato' | 1 | 1 | 2
'tomato' | 1 | 1 | 2
'brocolli' | 1 | 0 | 1
但是,我认为我错误地使用了聚合,所以我采用以下天真实现:
z = {}
for i in train.index:
fruit = train["fruit"][i]
l = train["target"][i]
if fruit not in z:
if l == 1:
z[fruit] = {1:1,0:0,'count':1}
else:
z[fruit] = {1:0,0:1,'count':1}
else:
if l == 1:
z[fruit][1] += 1
else:
z[fruit][0] += 1
z[fruit]['count'] += 1
相反,它以字典形式提供类似的输出。
任何人都可以通过pandas方式的正确语法来启发我吗? :)
谢谢! :)
答案 0 :(得分:4)
您需要groupby
+ size
+ unstack
+ add_prefix
+ sum
:
df1 = df.groupby(['Column 1','target_variable']).size() \
.unstack(fill_value=0) \
.add_prefix('target = ')
df1['total_count'] = df1.sum(axis=1)
print (df1)
target_variable target = 0 target = 1 total_count
Column 1
brocolli 0 1 1
potato 1 1 2
tomato 1 1 2
或crosstab
:
df1 = pd.crosstab(df['Column 1'],df['target_variable'], margins=True)
print (df1)
target_variable 0 1 All
Column 1
brocolli 0 1 1
potato 1 1 2
tomato 1 1 2
All 2 3 5
df1 = df1.rename(columns = {'All': 'total_count'}).iloc[:-1]
print (df1)
target_variable 0 1 total_count
Column 1
brocolli 0 1 1
potato 1 1 2
tomato 1 1 2
答案 1 :(得分:1)
让我们使用get_dummies
,add_prefix
和groupby
:
df = df.assign(**df['target_variable'].astype(str).str.get_dummies().add_prefix('target = '))
df['total_count'] = df.drop('target_variable', axis=1).sum(axis=1)
df.groupby('Column 1').sum()
输出:
target_variable target = 0 target = 1 total_count
Column 1
'brocolli' 1 0 1 1
'potato' 1 1 1 2
'tomato' 1 1 1 2