我正面临一个小熊猫的挑战,我很难想出来。
我使用以下代码
创建了两个数据帧df5 = dataFrame[['PdDistrict' , 'Category']]
df5 = df5[pd.notnull(df5['PdDistrict'])]
df5 = df5.groupby(['Category', 'PdDistrict']).size()
df5 = df5.reset_index()
df5 = df5.sort_values(['PdDistrict',0], ascending=False)
df6 = df5.groupby('PdDistrict')[0].sum()
df6 = df6.reset_index()
这给了我两个数据帧。 df5包含特定类别在特定区域中出现的次数。例如
'Category' 'PdDistrict' 'count'
Drugs Bayview 200
Theft Bayview 200
Gambling Bayview 200
Drugs CENTRAL 300
Theft CENTRAL 300
Gambling CENTRAL 300
df6帧包含给定PdDistrict的类别总数。
这为df6提供了以下外观
'PdDistrict' 'total count'
Bayview 600
CENTRAL 900
现在我想要的是df5看起来像这样:
'Category' 'PdDistrict' 'count' 'Average'
Drugs Bayview 200 0.33
Theft Bayview 200 0.33
Gambling Bayview 200 0.33
Drugs CENTRAL 200 0.22
Theft CENTRAL 200 0.22
Gambling CENTRAL 200 0.22
所以它基本上从df5计算并将其除以df6的totalcount,但是对于同一区域。我怎么能这样做?
res = df5.set_index('PdDistrict', append = False) / df6.set_index('PdDistrict', append = False)
上面给出了NaN的分类。
答案 0 :(得分:2)
您可以将total count
col添加到您的第一个df,然后您可以执行计算:
In [45]:
df['total count'] = df['PdDistrict'].map(df1.set_index('PdDistrict')['total count'])
df
Out[45]:
Category PdDistrict count total count
0 Drugs Bayview 200 600
1 Theft Bayview 200 600
2 Gambling Bayview 200 600
3 Drugs CENTRAL 300 900
4 Theft CENTRAL 300 900
5 Gambling CENTRAL 300 900
In [46]:
df['Average'] = df['count']/df['total count']
df
Out[46]:
Category PdDistrict count total count Average
0 Drugs Bayview 200 600 0.333333
1 Theft Bayview 200 600 0.333333
2 Gambling Bayview 200 600 0.333333
3 Drugs CENTRAL 300 900 0.333333
4 Theft CENTRAL 300 900 0.333333
5 Gambling CENTRAL 300 900 0.333333