在带有间隔类别列的数据帧中使用observed = True / False后,在熊猫中使用groupby时得到不同的结果。原则上,我相信我应该得到完全相同的结果。
作为示例,让我们假设以下数据框:
df_testing = pd.DataFrame({"a": ["good", "good", "good", "bad", "good", "good", "bad", "good",
"good", "good"],
"b": [1, 1, 2, 2, 3, 4, 5, 6, 11111, -5455]})
我转换“ b”列,以便将值分组为不同的时间间隔。我还强制将“ a”列归类:
df_testing["a"] = df_testing["a"].astype("category")
df_testing["b"] = pd.cut(df_testing["b"], [-9999, 0, 2, 5, 1e99], right=True)
如果我将观察值= False,则结果正确:
In[310]: df_testing.groupby(by="b", observed=False)["a"].value_counts()
Out[310]:
b a
(-9999.0, 0.0] good 1
(0.0, 2.0] good 3
bad 1
(2.0, 5.0] good 2
bad 1
(5.0, 1e+99] good 2
Name: a, dtype: int64
但要观察=真:
In[311]: df_testing.groupby(by="b", observed=True)["a"].value_counts()
Out[311]:
b a
(0.0, 2.0] good 1
(2.0, 5.0] good 3
bad 1
(5.0, 1e+99] good 2
bad 1
(-9999.0, 0.0] good 2
Name: a, dtype: int64
如您所见,计数是相同的...但是第二种情况下b列的标签是错误的!
我正在使用pandas v0.24.2(最新稳定版)
答案 0 :(得分:0)
这是bug in pandas,已在即将发布的0.25.0版本中修复:
In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+596.g20d0ad159a'
In [2]: df_testing = pd.DataFrame({"a": ["good", "good", "good", "bad", "good", "good",
...: "bad", "good", "good", "good"],
...: "b": [1, 1, 2, 2, 3, 4, 5, 6, 11111, -5455]})
In [3]: df_testing["a"] = df_testing["a"].astype("category")
In [4]: df_testing["b"] = pd.cut(df_testing["b"], [-9999, 0, 2, 5, 1e99], right=True)
In [5]: df_testing.groupby(by="b", observed=False)["a"].value_counts()
Out[5]:
b a
(-9999.0, 0.0] good 1
(0.0, 2.0] good 3
bad 1
(2.0, 5.0] good 2
bad 1
(5.0, 1e+99] good 2
Name: a, dtype: int64
In [6]: df_testing.groupby(by="b", observed=True)["a"].value_counts()
Out[6]:
b a
(-9999.0, 0.0] good 1
(0.0, 2.0] good 3
bad 1
(2.0, 5.0] good 2
bad 1
(5.0, 1e+99] good 2
Name: a, dtype: int64