Question

我有一个多级groupby，它从我的数据帧返回每个分组的分组行数。它将在没有标签的新列中显示。我正在尝试过滤不等于6的计数。我尝试为此创建一个True / False索引，但是我不知道如何从索引中获取结果。我也尝试过过滤器和lambda组合，但没有成功。

这是代码，其中人员，WL（波长），文件和阈值是我的数据帧（df_new）中的列。

df_new.groupby([df_new['Person'], df_new['WL'], df_new['File'], 
df_new['Threshold']])['RevNum'].count()

我返回了计数列表，但是，这是我所能得到的。我无法弄清楚如何仅查看不等于6的记录。

例如，在结果的底部，有以下条目：

656  TRW-2017-04-25_60_584  0            5

更大的结果示例：

Person  WL   File                   Threshold
AEM     440  AEM-2018-05-23_11_440  0            6
                                    1            6
             AEM-2018-05-23_50_440  0            6
                                    1            6
        452  AEM-2018-05-23_11_440  0            6
                                    1            6
             AEM-2018-05-23_50_440  0            6
                                    1            6
        464  AEM-2018-05-23_11_440  0            6
                                    1            6
             AEM-2018-05-23_50_440  0            6
                                    1            6
        476  AEM-2018-05-23_11_440  0            6
                                    1            6
             AEM-2018-05-23_50_440  0            6
                                    1            6
        488  AEM-2018-05-23_11_440  0            6
                                    1            6
             AEM-2018-05-23_50_440  0            6
                                    1            6
AGC     440  AGC-2018-05-25_12_440  0            6
                                    1            6
             AGC-2018-05-25_50_440  0            6
                                    1            6
        452  AGC-2018-05-25_12_440  0            6
                                    1            6
             AGC-2018-05-25_50_440  0            6
                                    1            6
        464  AGC-2018-05-25_12_440  0            6
                                    1            6
                                                ..
TRW     620  TRW-2017-04-08_60_572  0            6
                                    1            6
        632  TRW-2017-04-25_60_584  0            6
                                    1            6
        644  TRW-2017-04-08_60_572  0            6
                                    1            6
        656  TRW-2017-04-25_60_584  0            5
                                    1            6
             TRW-2017-04-25_60_656  0            6
                                    1            6

当我将代码更改为：

df_counts = df_new.groupby([df_new['Person'], df_new['WL'], df_new['File'], 
df_new['Threshold']])['RevNum'].count()

它将其存储为系列而不是数据帧，并且我无法使用值（groupby的计数结果）访问最后一列。

当我尝试：

df_counts_grouped = df_new.groupby([df_new['Person'], df_new['WL'], 
                    df_new['File'], df_new['Threshold']])['RevNum'].count()
df_counts_grouped.filter(lambda x: x['B'].max() != 6)

我尝试了.max，.min，.count等。

它表示“功能”对象不可迭代。相信系列不是可迭代的吗？感谢筛选我的groupby结果的任何帮助。

如果我可以将groupby的结果放入一个新的数据框中并重命名结果的“ count”列，则可以访问它。不确定如何将带有计数的groupby结果发送到新的数据框。另外，我不确定如何使用结果仅从第一个数据框中选择适当的行，因为它是原始数据框中许多行的计数。

在进行任何分组依据之前，数据帧看起来都是这样的。

File    Threshold   StepSize    RevNum  WL  RevPos  BkgdLt  Person  Date    AbRevPos    ExpNum  EarlyEnd
48  AEM-2018-05-23_11_440   1   1.50    7.0 464 -2.07   11  AEM 2018-05-23  2.07    Two NaN
49  AEM-2018-05-23_11_440   1   0.82    8.0 464 -3.57   11  AEM 2018-05-23  3.57    Two NaN
50  AEM-2018-05-23_11_440   1   1.50    7.0 488 -2.58   11  AEM 2018-05-23  2.58    Two NaN
54  AEM-2018-05-23_11_440   1   0.82    8.0 488 -5.58   11  AEM 2018-05-23  5.58    Two NaN
55  AEM-2018-05-23_11_440   1   1.50    7.0 440 -3.00   11  AEM 2018-05-23  3.00    Two NaN

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3286 entries, 48 to 7839
Data columns (total 12 columns):
File         3286 non-null object
Threshold    3286 non-null int64
StepSize     3286 non-null float64
RevNum       3286 non-null float64
WL           3286 non-null int64
RevPos       3286 non-null float64
BkgdLt       3286 non-null int32
Person       3286 non-null object
Date         3286 non-null datetime64[ns]
AbRevPos     3286 non-null float64
ExpNum       3286 non-null object
EarlyEnd     0 non-null float64
dtypes: datetime64[ns](1), float64(5), int32(1), int64(2), object(3)
memory usage: 320.9+ KB

此代码：

df_counts_grouped = df_new.groupby([df_new['Person'], df_new['WL'], df_new['File'], df_new['Threshold']])['RevNum'].count()
df_counts_grouped.head(10)

产生此输出：

Person  WL   File                   Threshold
AEM     440  AEM-2018-05-23_11_440  0            6
                                    1            6
             AEM-2018-05-23_50_440  0            6
                                    1            6
        452  AEM-2018-05-23_11_440  0            6
                                    1            6
             AEM-2018-05-23_50_440  0            6
                                    1            6
        464  AEM-2018-05-23_11_440  0            6
                                    1            6
Name: RevNum, dtype: int64

我找到了我的问题的答案的开始，这就是语法。它在于Pandas系列和Pandas DataFrames之间的区别！

df_new.groupby('Person')['WL'].count() # produces Pandas Series
df_new.groupby('Person')[['WL']].count() # Produces Pandas DataFrame

发现于：https://shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

Answer 1

我为您创建了一个简短的完整且可验证的示例：

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'Letter':['a', 'b']*2, 'Number':[1]*3+[2]})

In [3]: df
Out[3]: 
  Letter  Number
0      a       1
1      b       1
2      a       1
3      b       2

In [4]: df.groupby(['Letter', 'Number'])['Number'].count()
Out[4]: 
Letter  Number
a       1         2
b       1         1
        2         1
Name: Number, dtype: int64

In [5]: grouped_counts = df.groupby(['Letter', 'Number'])['Number'].count()

In [6]: type(grouped_counts)
Out[6]: pandas.core.series.Series

如您所见，最大计数为2，所以让我们过滤计数低于2的所有组。

In [7]: grouped_counts.loc[grouped_counts<2]
Out[7]: 
Letter  Number
b       1         1
        2         1

Answer 2

我知道了！从Series更改为DataFrame是一个非常简单的语法问题！

df_new.groupby('Person')['WL'].count() # produces Pandas Series
df_new.groupby('Person')[['WL']].count() # Produces Pandas DataFrame

发现于：https://shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

我的代码现在看起来像这样，我只能取回反转编号（RevNum）不为6的条目。

df_counts_grouped = df_new.groupby([df_new['Person'], df_new['WL'], df_new['File'], df_new['Threshold']])[['RevNum']].count()

df_counts_grouped[df_counts_grouped['RevNum'] != 6]

“ RevNum”周围的单个括号中的简单变化：

df_counts_grouped = df_new.groupby([df_new['Person'], df_new['WL'], df_new['File'], df_new['Threshold']])['RevNum'].count()

要在我的列标签“ RevNum”两边加上括号：

df_counts_grouped = df_new.groupby([df_new['Person'], df_new['WL'], df_new['File'], df_new['Threshold']])[['RevNum']].count()

修复所有问题！

过滤多级分组依据结果

2 个答案: