分组对象与熊猫的标准偏差计算结果令人困惑

时间:2019-11-22 13:51:37

标签: python pandas

比方说,数据框具有列namecategoryrank,其中name是个人的名字,category是类别变量,rank一行中个人的排名。

首先,我希望每个namecategory的均值为:

X = df.groupby(['name','category'])['rank'].agg('mean')
#out:
+---------+-------------------+------+
|  name   | category          |      |
+---------+-------------------+------+
| 1260229 |                 9 | 11.0 |
|         |                18 | 9.50 |
| 1126191 |                 5 | 4.00 |
|         |                17 | 3.00 |
|         |                23 | 4.00 |
| 1065670 |                33 | 3.00 |
|         |                39 | 5.00 |
|         |                41 | 8.00 |
+---------+-------------------+------+

现在是标准偏差,

X.reset_index().groupby('name')['rank'].agg(np.std)
#out:
+---------+------+
|  name   |      |
+---------+------+
| 1260229 | 1.06 |
| 1126191 | 0.58 |
| 1065670 | 2.51 |
+---------+------+
#Note here that "rank" is actually the mean of rank by category. I just didn't change the name
#of the column for the new dataframe issued from X.reset_index()

问题是,当我(对于个人1260229)计算为np.std([11,9.50])时,它会返回0.75而不是1.06,对于其他个人来说是相同的问题。

我不知道哪里会有错误的操作来产生这些错误的结果。


熊猫版本:0.23.4 Python版本:3.7.4

1 个答案:

答案 0 :(得分:2)

https://docs.microsoft.com/en-us/dotnet/api/system.net.servicepointmanager.securityprotocol?view=netframework-4.8中,熊猫默认为ddof=1,在numpy中,DataFrame.std0

您可以仅将std与第二个分组方法一起使用,并使用level=0参数来简化解决方案:

s = X.std(level=0)
print (s)
name
1260229    1.060660
1126191    0.577350
1065670    2.516611
Name: rank, dtype: float64

s = X.std(level=0, ddof=1)
print (s)
name
1260229    1.060660
1126191    0.577350
1065670    2.516611
Name: rank, dtype: float64

还有ddof=0

s = X.std(level=0, ddof=0)
print (s)
name
1260229    0.750000
1126191    0.471405
1065670    2.054805
Name: rank, dtype: float64

如果要使用groupby也可以:

s = X.groupby(level=0, sort=False).std(ddof=0)
print (s)
name
1260229    0.750000
1126191    0.471405
1065670    2.054805
Name: rank, dtype: float64