Question

另一个更新：已解决（请参阅评论和我自己的回答）。

更新：这就是我要解释的内容。

>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575

答案：标准偏差公式的分母中Bessel's correction，N-1代替N解释了这一点。我希望熊猫使用与numpy相同的约定。

有一个相关的讨论here，但他们的建议也不起作用。

我有很多不同餐厅的数据。这是我的数据框架（想象不止一个餐厅，但效果仅用一个复制）：

>>> df
restaurant_id  price
id                      
1           10407      7
3           10407     20
6           10407     22
13          10407     22

问题：r.mi.groupby('restaurant_id')['price'].mean()为每家餐厅返回价格。我想获得标准偏差。但是，r.mi.groupby('restaurant_id')['price'].std() 会返回错误的值。

正如您所看到的，为简单起见，我只提取了一家有四件物品的餐馆。我想找到价格的标准差。只是为了确保：

>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575

我们可以使用

获得相同（正确）的值

>>> np.mean(df)
restaurant_id    10407.00
price               17.75
dtype: float64
>>> np.std(df)
restaurant_id    0.000000
price            6.259992
dtype: float64

（当然，无视平均餐厅ID。）显然，当我有一家以上的餐厅时，np.std(df)不是解决方案。所以我使用groupby。

>>> df.groupby('restaurant_id').agg('std')
                  price
restaurant_id          
10407          7.228416

什么？ 7.228416不是6.259992。

让我们再试一次。

>>> df.groupby('restaurant_id').std()

同样的事情。

>>> df.groupby('restaurant_id')['price'].std()

同样的事情。

>>> df.groupby('restaurant_id').apply(lambda x: x.std())

同样的事情。

然而，这有效：

for id, group in df.groupby('restaurant_id'):
  print id, np.std(group['price'])

问题：有没有合适的方法来汇总数据框，所以我会得到一个新的时间序列，其中包含每个餐厅的标准偏差？

Answer 1

我明白了。 Pandas默认使用Bessel's correction - 即分母中带有N-1而非N的标准差公式。正如behzad.nouri在评论中指出的那样，

pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])

Pandas：为什么pandas.Series.std（）与numpy.std（）不同

1 个答案: