I'm trying to get the number of outliers by group from a Pandas data frame.
My data looks like this.
df = pd.DataFrame({'group':list('aaaabbbb'),
'val':[1,3,3,2,5,6,6,2],
'id':[1,1,2,2,2,3,3,3],
'mydate':['01/01/2011 01:00:00',
'01/01/2011 01:02:00',
'01/01/2011 01:05:00',
'01/01/2011 01:06:00',
'01/01/2011 03:00:00',
'01/01/2011 04:00:00',
'01/01/2011 05:00:00',
'01/01/2011 10:00:00']})
df
To get the number of outliers, I'm using the following function that gets the IQR.
def get_IQR():
q1 = df["val"].quantile(0.25)
q3 = df["val"].quantile(0.75)
iqr = (df["val"] > q1) & (df["val"] < q3)
return val.loc[iqr]
df[["group","val"]].agg([get_IQR])
This doesn't work and produced the following results
ValueError: no results
Does anyone have a better strategy for finding the number of outliers per group, such that...
group num_outliers
a ...
b ...
c ...
答案 0 :(得分:2)
如果要使用聚合函数,则需要以不同方式定义它。 Pandas会将向量传递给函数,函数需要输出单个值。所以:
def get_num_outliers (column):
q1 = np.percentile(column, 25)
q3 = np.percentile(column, 75)
return sum((column<q1) | (column>q3))
然后这样称呼:
df.groupby('group').agg([get_num_outliers])
答案 1 :(得分:1)
Here is one way:
q1 = df['val'].quantile(0.25)
q3 = df['val'].quantile(0.75)
df['Outlier'] = ~df['val'].between(q1, q3)
df.groupby('group')['Outlier'].sum().astype(int).reset_index()
# group Outlier
# 0 a 1
# 1 b 2
Explanation
Outlier
column as a Boolean based on whether val
is within the interquartile range.group
and sum the Outlier
series. This is possible because bool
is a subclass of int
, i.e. True == 1
and False == 0
.int
as result should only be whole numbers (float
is default).答案 2 :(得分:0)
这是另一种方式(基于jpp's answer):
q1 = df['val'].quantile(0.25)
q3 = df['val'].quantile(0.75)
df['Outlier'] = ~df['val'].between(q1, q3)
df.groupby(['group', 'Outlier'])['id'].count()
# group Outlier
# a False 3
# True 1
# b False 2
# True 2
# Name: id, dtype: int64
<强>解释强>:
'Outlier'
是否在四分位数范围内,将'val'
列定义为布尔值。将其另存为DataFrame中的列。'group'
和'Outlier'
列分组,count
分组显示某些非空字段(在这种情况下,我选择了'id'
,但您可以选择您的示例中的任何列,或者只是在groupby
的结果上调用count
而不选择任何列。)使用这个两列groupby
语句的优点是,如果您想稍后检查它们,可以免费获得所有组/异常值组合。要以特定要求的格式获得结果,请在分组前按'Outlier'
进行子集化:
df.loc[df['Outlier']].groupby('group')['id'].count().reset_index().rename(columns={'id': 'num_outliers'})
# group num_outliers
# 0 a 1
# 1 b 2