在大熊猫中观察到的结果不同=真/假

时间:2019-05-28 09:28:32

标签: python pandas pandas-groupby

在带有间隔类别列的数据帧中使用observed = True / False后,在熊猫中使用groupby时得到不同的结果。原则上,我相信我应该得到完全相同的结果。

作为示例,让我们假设以下数据框:

    df_testing = pd.DataFrame({"a": ["good", "good", "good", "bad", "good", "good", "bad", "good",
                                "good", "good"],
                          "b": [1, 1, 2, 2, 3, 4, 5, 6, 11111, -5455]})

我转换“ b”列,以便将值分组为不同的时间间隔。我还强制将“ a”列归类:

    df_testing["a"] = df_testing["a"].astype("category")
    df_testing["b"] = pd.cut(df_testing["b"], [-9999, 0, 2, 5, 1e99], right=True)

如果我将观察值= False,则结果正确:

    In[310]: df_testing.groupby(by="b", observed=False)["a"].value_counts()

    Out[310]:
    b               a   
    (-9999.0, 0.0]  good    1
    (0.0, 2.0]      good    3
                    bad     1
    (2.0, 5.0]      good    2
                    bad     1
    (5.0, 1e+99]    good    2
    Name: a, dtype: int64

但要观察=真:

    In[311]: df_testing.groupby(by="b", observed=True)["a"].value_counts()

    Out[311]:
    b               a   
    (0.0, 2.0]      good    1
    (2.0, 5.0]      good    3
                    bad     1
    (5.0, 1e+99]    good    2
                    bad     1
    (-9999.0, 0.0]  good    2
    Name: a, dtype: int64

如您所见,计数是相同的...但是第二种情况下b列的标签是错误的!

我正在使用pandas v0.24.2(最新稳定版)

1 个答案:

答案 0 :(得分:0)

这是bug in pandas,已在即将发布的0.25.0版本中修复:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+596.g20d0ad159a'

In [2]: df_testing = pd.DataFrame({"a": ["good", "good", "good", "bad", "good", "good",
   ...:                                  "bad", "good", "good", "good"],
   ...:                            "b": [1, 1, 2, 2, 3, 4, 5, 6, 11111, -5455]})

In [3]: df_testing["a"] = df_testing["a"].astype("category")

In [4]: df_testing["b"] = pd.cut(df_testing["b"], [-9999, 0, 2, 5, 1e99], right=True)

In [5]: df_testing.groupby(by="b", observed=False)["a"].value_counts()
Out[5]:
b               a
(-9999.0, 0.0]  good    1
(0.0, 2.0]      good    3
                bad     1
(2.0, 5.0]      good    2
                bad     1
(5.0, 1e+99]    good    2
Name: a, dtype: int64

In [6]: df_testing.groupby(by="b", observed=True)["a"].value_counts()
Out[6]:
b               a
(-9999.0, 0.0]  good    1
(0.0, 2.0]      good    3
                bad     1
(2.0, 5.0]      good    2
                bad     1
(5.0, 1e+99]    good    2
Name: a, dtype: int64