Question

比方说，数据框具有列name，category，rank，其中name是个人的名字，category是类别变量，rank一行中个人的排名。

首先，我希望每个name和category的均值为：

X = df.groupby(['name','category'])['rank'].agg('mean')
#out:
+---------+-------------------+------+
|  name   | category          |      |
+---------+-------------------+------+
| 1260229 |                 9 | 11.0 |
|         |                18 | 9.50 |
| 1126191 |                 5 | 4.00 |
|         |                17 | 3.00 |
|         |                23 | 4.00 |
| 1065670 |                33 | 3.00 |
|         |                39 | 5.00 |
|         |                41 | 8.00 |
+---------+-------------------+------+

现在是标准偏差，

X.reset_index().groupby('name')['rank'].agg(np.std)
#out:
+---------+------+
|  name   |      |
+---------+------+
| 1260229 | 1.06 |
| 1126191 | 0.58 |
| 1065670 | 2.51 |
+---------+------+
#Note here that "rank" is actually the mean of rank by category. I just didn't change the name
#of the column for the new dataframe issued from X.reset_index()

问题是，当我（对于个人1260229）计算为np.std([11,9.50])时，它会返回0.75而不是1.06，对于其他个人来说是相同的问题。

我不知道哪里会有错误的操作来产生这些错误的结果。

熊猫版本：0.23.4 Python版本：3.7.4

Answer 1

在https://docs.microsoft.com/en-us/dotnet/api/system.net.servicepointmanager.securityprotocol?view=netframework-4.8中，熊猫默认为ddof=1，在numpy中，DataFrame.std是0。

您可以仅将std与第二个分组方法一起使用，并使用level=0参数来简化解决方案：

s = X.std(level=0)
print (s)
name
1260229    1.060660
1126191    0.577350
1065670    2.516611
Name: rank, dtype: float64

s = X.std(level=0, ddof=1)
print (s)
name
1260229    1.060660
1126191    0.577350
1065670    2.516611
Name: rank, dtype: float64

还有ddof=0：

s = X.std(level=0, ddof=0)
print (s)
name
1260229    0.750000
1126191    0.471405
1065670    2.054805
Name: rank, dtype: float64

如果要使用groupby也可以：

s = X.groupby(level=0, sort=False).std(ddof=0)
print (s)
name
1260229    0.750000
1126191    0.471405
1065670    2.054805
Name: rank, dtype: float64

分组对象与熊猫的标准偏差计算结果令人困惑

1 个答案: