Question

I'm trying to get the number of outliers by group from a Pandas data frame.

My data looks like this.

df = pd.DataFrame({'group':list('aaaabbbb'),
                   'val':[1,3,3,2,5,6,6,2],
                   'id':[1,1,2,2,2,3,3,3],
                   'mydate':['01/01/2011 01:00:00',
                             '01/01/2011 01:02:00',
                             '01/01/2011 01:05:00',
                             '01/01/2011 01:06:00',
                             '01/01/2011 03:00:00',
                             '01/01/2011 04:00:00',
                             '01/01/2011 05:00:00',
                             '01/01/2011 10:00:00']})
df

To get the number of outliers, I'm using the following function that gets the IQR.

def get_IQR():
    q1 = df["val"].quantile(0.25)
    q3 = df["val"].quantile(0.75)
    iqr = (df["val"] > q1) & (df["val"] < q3)
    return val.loc[iqr]

df[["group","val"]].agg([get_IQR])

This doesn't work and produced the following results

ValueError: no results

Does anyone have a better strategy for finding the number of outliers per group, such that...

group   num_outliers
a        ...
b        ...
c        ...

Answer 1

如果要使用聚合函数，则需要以不同方式定义它。 Pandas会将向量传递给函数，函数需要输出单个值。所以：

def get_num_outliers (column):
 q1 = np.percentile(column, 25)
 q3 = np.percentile(column, 75)
 return sum((column<q1) | (column>q3))

然后这样称呼：

 df.groupby('group').agg([get_num_outliers])

Answer 2

Here is one way:

q1 = df['val'].quantile(0.25)
q3 = df['val'].quantile(0.75)

df['Outlier'] = ~df['val'].between(q1, q3)

df.groupby('group')['Outlier'].sum().astype(int).reset_index()

#   group  Outlier
# 0     a        1
# 1     b        2

Explanation

We define an Outlier column as a Boolean based on whether val is within the interquartile range.
We then groupby group and sum the Outlier series. This is possible because bool is a subclass of int, i.e. True == 1 and False == 0.
Convert to int as result should only be whole numbers (float is default).

Answer 3

这是另一种方式（基于jpp's answer）：

q1 = df['val'].quantile(0.25)
q3 = df['val'].quantile(0.75)

df['Outlier'] = ~df['val'].between(q1, q3)

df.groupby(['group', 'Outlier'])['id'].count()

# group  Outlier
# a      False      3
#        True       1
# b      False      2
#        True       2
# Name: id, dtype: int64

<强>解释：

根据'Outlier'是否在四分位数范围内，将'val'列定义为布尔值。将其另存为DataFrame中的列。
按'group'和'Outlier'列分组，count分组显示某些非空字段（在这种情况下，我选择了'id'，但您可以选择您的示例中的任何列，或者只是在groupby的结果上调用count而不选择任何列。）

使用这个两列groupby语句的优点是，如果您想稍后检查它们，可以免费获得所有组/异常值组合。要以特定要求的格式获得结果，请在分组前按'Outlier'进行子集化：

df.loc[df['Outlier']].groupby('group')['id'].count().reset_index().rename(columns={'id': 'num_outliers'})

#   group   num_outliers
# 0 a       1
# 1 b       2

Get the number of outliers by group in Pandas

3 个答案: