Question

这可能是个错误吗？当我使用describe（）或std（）作为groupby对象时，我会得到不同的答案

import pandas as pd
import numpy as np
import random as rnd

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
     ...:                           'foo', 'bar', 'foo', 'foo'],
     ...:                    'B' : ['one', 'one', 'two', 'three',
     ...:                           'two', 'two', 'one', 'three'],
     ...:                    'C' : 1*(np.random.randn(8)>0.5),
     ...:                    'D' : np.random.randn(8)})
df.head()

df[['C','D']].groupby(['C'],as_index=False).describe()
# this line gives me the standard deviation of 'C' to be 0,0. Within each    group value of C is constant, so that makes sense. 

df[['C','D']].groupby(['C'],as_index=False).std()
# This line gives me the standard deviation of 'C' to be 0,1. I think this is wrong

Answer 1

这很有道理。在第二种情况下，仅计算列std 的D。

如何？这就是groupby的工作原理。你

C和D
groupby C
致电GroupBy.std

在第3步，您没有指定任何列，因此假定{<1}}是在不石斑鱼的列上计算的...... aka，{{1}列}。

对于为什么，您看到std与D ...这是因为您指定了C，因此插入了0, 1列来自原始dataFrame的值...在这种情况下为as_index=False。

运行它，它会变得清晰。

指定0, 1时，您在上面看到的索引将作为列插入。与此对比，

df[['C','D']].groupby(['C']).std()

          D
C          
0  0.998201
1       NaN

这正是as_index=False提供的内容，以及您正在寻找的内容。

Answer 2

我的朋友mukherjees和我已经用这个做了更多的试验，并且认为std（）确实存在问题。您可以在以下链接中看到，我们如何显示“std（）与.apply（np.std，ddof = 1）不同。”注意到之后，我们还发现了以下相关错误报告：

https://github.com/pandas-dev/pandas/issues/10355

Answer 3

即使使用std（），您也会在每个组中获得C的零标准差。我刚刚为你的代码添加了一个种子，使其可以复制。我不确定是什么问题 -

import pandas as pd
import numpy as np
import random as rnd

np.random.seed=1987
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
     'foo', 'bar', 'foo', 'foo'],
     'B' : ['one', 'one', 'two', 'three',
     'two', 'two', 'one', 'three'],
     'C' : 1*(np.random.randn(8)>0.5),
     'D' : np.random.randn(8)})
df

df[['C','D']].groupby(['C'],as_index=False).describe()

df[['C','D']].groupby(['C'],as_index=False).std()

进一步深入，如果你看一下继承自DataFrame.describe的groupby的describe源代码，

def describe_numeric_1d(series):
            stat_index = (['count', 'mean', 'std', 'min'] +
                          formatted_percentiles + ['max'])
            d = ([series.count(), series.mean(), series.std(), series.min()] +
                 [series.quantile(x) for x in percentiles] + [series.max()])
            return pd.Series(d, index=stat_index, name=series.name)

上面的代码显示，describe仅显示std（）的结果

std（）groupby Pandas问题

3 个答案: