Question

我正面临一个小熊猫的挑战，我很难想出来。

我使用以下代码

创建了两个数据帧

df5 = dataFrame[['PdDistrict' , 'Category']]
df5 = df5[pd.notnull(df5['PdDistrict'])]
df5 = df5.groupby(['Category', 'PdDistrict']).size()
df5 = df5.reset_index()
df5 = df5.sort_values(['PdDistrict',0], ascending=False)

df6 = df5.groupby('PdDistrict')[0].sum()
df6 = df6.reset_index()

这给了我两个数据帧。 df5包含特定类别在特定区域中出现的次数。例如

'Category'   'PdDistrict'  'count'
   Drugs       Bayview       200
   Theft       Bayview       200
   Gambling    Bayview       200
   Drugs       CENTRAL       300
   Theft       CENTRAL       300
   Gambling    CENTRAL       300

df6帧包含给定PdDistrict的类别总数。

这为df6提供了以下外观

'PdDistrict' 'total count'
  Bayview        600
  CENTRAL        900

现在我想要的是df5看起来像这样：

'Category'   'PdDistrict'  'count'      'Average'
   Drugs       Bayview       200           0.33
   Theft       Bayview       200           0.33
   Gambling    Bayview       200           0.33
   Drugs       CENTRAL       200           0.22
   Theft       CENTRAL       200           0.22
   Gambling    CENTRAL       200           0.22

所以它基本上从df5计算并将其除以df6的totalcount，但是对于同一区域。我怎么能这样做？

res = df5.set_index('PdDistrict', append = False) / df6.set_index('PdDistrict', append = False)

上面给出了NaN的分类。

Answer 1

您可以将total count col添加到您的第一个df，然后您可以执行计算：

In [45]:
df['total count'] = df['PdDistrict'].map(df1.set_index('PdDistrict')['total count'])
df

Out[45]:
   Category PdDistrict  count  total count
0     Drugs    Bayview    200          600
1     Theft    Bayview    200          600
2  Gambling    Bayview    200          600
3     Drugs    CENTRAL    300          900
4     Theft    CENTRAL    300          900
5  Gambling    CENTRAL    300          900

In [46]:
df['Average'] = df['count']/df['total count']
df

Out[46]:
   Category PdDistrict  count  total count   Average
0     Drugs    Bayview    200          600  0.333333
1     Theft    Bayview    200          600  0.333333
2  Gambling    Bayview    200          600  0.333333
3     Drugs    CENTRAL    300          900  0.333333
4     Theft    CENTRAL    300          900  0.333333
5  Gambling    CENTRAL    300          900  0.333333

使用不同大小的数据帧划分pandas中的列

1 个答案: