如何在数据透视表中计算汇总方差

时间:2019-07-31 06:55:34

标签: pandas pivot-table

在数据透视表中使用aggfunc = np.var时。我发现指标的值变为NaN。但是涉及aggfunc = np.sum却没有。

为什么用aggfunc = np.varaggfunc = np.std更改了原始值。我在文档中找不到答案。 docs of pivot table

import pandas as pd
import numpy as np
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                          "bar", "bar", "bar", "bar"],
                    "B": ["one", "one", "one", "two", "two",
                          "one", "one", "two", "two"],
                    "C": ["small", "large", "large", "small",
                          "small", "large", "small", "small",
                          "large"],
                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
print(df.pivot_table(
    index = ['A', 'B'],
    values = ['D', 'E'],
    columns = ['C'],
    aggfunc= np.sum,
    margins=True,
    margins_name = 'sum',
    dropna = False
))
print('-' * 100)
df = df.pivot_table(
    index = ['A', 'B'],
    values = ['D', 'E'],
    columns = ['C'],
    aggfunc= np.var,
    margins=True,
    margins_name = 'var',
    dropna = False
)
print(df)
            D               E          
C       large small sum large small sum
A   B                                  
bar one   4.0   5.0   9   6.0   8.0  14
    two   7.0   6.0  13   9.0   9.0  18
foo one   4.0   1.0   5   9.0   2.0  11
    two   NaN   6.0   6   NaN  11.0  11
sum      15.0  18.0  33  24.0  30.0  54
-----------------------------------------------------------------------
                D                         E                
C           large small       var     large small       var
A   B                                                      
bar one       NaN   NaN  0.500000       NaN   NaN  2.000000
    two       NaN   NaN  0.500000       NaN   NaN  0.000000
foo one  0.000000   NaN  0.333333  0.500000   NaN  2.333333
    two       NaN   0.0  0.000000       NaN   0.5  0.500000
var      5.583333   3.8  3.555556  4.666667   7.5  4.888889

更重要的是,我发现D = large的变量是np.var([4.0, 7.0, 4.0]) = 2.0而不是5.583333

我期望的是:

            D               E          
C       large small var large small var
A   B                                  
bar one   4.0   5.0  0.25  6.0   8.0   1.0
    two   7.0   6.0  0.25  9.0   9.0   0
foo one   4.0   1.0  2.25  9.0   2.0   12.25
    two   NaN   6.0  0     NaN   11.0  0.0
var       2.0   4.25 3.6   2.0   11.25 7.34

数据透视表中的aggfunc = np.var是什么意思?

1 个答案:

答案 0 :(得分:1)

default ddof = 1使用熊猫,有关np.var的详细信息,请参见here

当您只有一个值时,尝试除以零时,使用ddof = 1的方差将为NaN

D = large的变量为np.var([2,2,4,7], ddof=1) = 5.583333333333333,所以一切都正确(您必须使用各个值,而不是总和)。


如果您需要varddof = 0,则可以提供自己的功能:

def var0(x):
    return np.var(x, ddof=0)

print(df.pivot_table(
    index = ['A', 'B'],
    values = ['D', 'E'],
    columns = ['C'],
    aggfunc= var0,
    margins=True,
    margins_name = 'var',
    dropna = False
))

结果:

              D                     E                
C         large small       var large small       var
A   B                                                
bar one  0.0000  0.00  0.250000  0.00  0.00  1.000000
    two  0.0000  0.00  0.250000  0.00  0.00  0.000000
foo one  0.0000  0.00  0.222222  0.25  0.00  1.555556
    two     NaN  0.00  0.000000   NaN  0.25  0.250000
var      4.1875  3.04  3.555556  3.50  6.00  4.888889


根据已编辑的问题进行更新
枢纽分析表,其和为C,并且总和的和为margin列/行。

我们首先创建一个sum数据透视表,其中的边距列/行名为var。然后,我们用var表的sum更新了这些边距列/行:

dfs = df.pivot_table(
    index = ['A', 'B'],
    values = ['D', 'E'],
    columns = ['C'],
    aggfunc= np.sum,
    margins=True,
    margins_name = 'var',
    dropna = False)

dfs[[('D','var'),('E','var')]] = df.pivot_table(
    index = ['A', 'B'],
    values = ['D', 'E'],
    columns = ['C'],
    aggfunc= np.sum,
    dropna = False).stack().groupby(level=(0,1)).apply(var0)
dfs.iloc[-1] = dfs.iloc[:-1].apply(var0)

结果:

            D                     E                  
C       large small       var large  small        var
A   B                                                
bar one   4.0  5.00  0.250000   6.0   8.00   1.000000
    two   7.0  6.00  0.250000   9.0   9.00   0.000000
foo one   4.0  1.00  2.250000   9.0   2.00  12.250000
    two   NaN  6.00  0.000000   NaN  11.00   0.000000
var       2.0  4.25  0.824219   2.0  11.25  26.792969

在页边距行(最后一行)中,var列计算为行vars的var。我不明白OP如何计算这两个单元格的值。无论如何,它们似乎没有多大意义。