Get the number of outliers by group in Pandas

时间:2018-02-26 17:43:55

标签: python pandas

I'm trying to get the number of outliers by group from a Pandas data frame.

My data looks like this.

df = pd.DataFrame({'group':list('aaaabbbb'),
                   'val':[1,3,3,2,5,6,6,2],
                   'id':[1,1,2,2,2,3,3,3],
                   'mydate':['01/01/2011 01:00:00',
                             '01/01/2011 01:02:00',
                             '01/01/2011 01:05:00',
                             '01/01/2011 01:06:00',
                             '01/01/2011 03:00:00',
                             '01/01/2011 04:00:00',
                             '01/01/2011 05:00:00',
                             '01/01/2011 10:00:00']})
df

To get the number of outliers, I'm using the following function that gets the IQR.

def get_IQR():
    q1 = df["val"].quantile(0.25)
    q3 = df["val"].quantile(0.75)
    iqr = (df["val"] > q1) & (df["val"] < q3)
    return val.loc[iqr]

df[["group","val"]].agg([get_IQR])    

This doesn't work and produced the following results

ValueError: no results

Does anyone have a better strategy for finding the number of outliers per group, such that...

group   num_outliers
a        ...
b        ...
c        ...

3 个答案:

答案 0 :(得分:2)

如果要使用聚合函数,则需要以不同方式定义它。 Pandas会将向量传递给函数,函数需要输出单个值。所以:

def get_num_outliers (column):
 q1 = np.percentile(column, 25)
 q3 = np.percentile(column, 75)
 return sum((column<q1) | (column>q3))

然后这样称呼:

 df.groupby('group').agg([get_num_outliers])

答案 1 :(得分:1)

Here is one way:

q1 = df['val'].quantile(0.25)
q3 = df['val'].quantile(0.75)

df['Outlier'] = ~df['val'].between(q1, q3)

df.groupby('group')['Outlier'].sum().astype(int).reset_index()

#   group  Outlier
# 0     a        1
# 1     b        2

Explanation

  • We define an Outlier column as a Boolean based on whether val is within the interquartile range.
  • We then groupby group and sum the Outlier series. This is possible because bool is a subclass of int, i.e. True == 1 and False == 0.
  • Convert to int as result should only be whole numbers (float is default).

答案 2 :(得分:0)

这是另一种方式(基于jpp's answer):

q1 = df['val'].quantile(0.25)
q3 = df['val'].quantile(0.75)

df['Outlier'] = ~df['val'].between(q1, q3)

df.groupby(['group', 'Outlier'])['id'].count()

# group  Outlier
# a      False      3
#        True       1
# b      False      2
#        True       2
# Name: id, dtype: int64

<强>解释

  • 根据'Outlier'是否在四分位数范围内,将'val'列定义为布尔值。将其另存为DataFrame中的列。
  • 'group''Outlier'列分组,count分组显示某些非空字段(在这种情况下,我选择了'id',但您可以选择您的示例中的任何列,或者只是在groupby的结果上调用count而不选择任何列。)

使用这个两列groupby语句的优点是,如果您想稍后检查它们,可以免费获得所有组/异常值组合。要以特定要求的格式获得结果,请在分组前按'Outlier'进行子集化:

df.loc[df['Outlier']].groupby('group')['id'].count().reset_index().rename(columns={'id': 'num_outliers'})

#   group   num_outliers
# 0 a       1
# 1 b       2