在Python Pandas中聚合之前删除异常值

时间:2016-02-18 11:24:28

标签: python pandas

我有一个包含5列的DataFrame。我将前4列分组并计算第5列的平均值,标准值和计数值。

我使用以下代码执行此操作:

df.groupby(['col1','col2','col3','col4']).agg([np.mean, np.std, len])

现在我的问题是,我有一个用平均值替换异常值的函数。我怎样才能删除那些异常值的行?

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        
    return group

    df.groupby(['col1','col2','col3','col4']).transform(replace)

第二个问题,

当我尝试将transform和agg结合起来时,我有以下错误:

df.groupby(['col1','col2','col3','col4']).transform(replace).agg([np.mean, np.std, len])

AttributeError: 'DataFrame' object has no attribute 'agg'

1 个答案:

答案 0 :(得分:1)

transform()返回DataFrame没有agg()方法,您需要再次调用groupby()方法。或者您可以保存groupby对象,并重用它的grouper属性。

要删除离群值,您需要调用apply()并获取布尔系列mask,然后选择行,然后再次调用groupby()

import pandas as pd
import numpy as np

N = 10000
df = pd.DataFrame(np.random.randint(0, 5, size=(N, 4)), columns=["c1", "c2", "c3", "c4"])
df["c5"] = np.random.randn(N)

def replace(group):
    mean, std = group.mean(), group.std()
    inliers = (group - mean).abs() <= 2*std
    return group.where(inliers, mean)

def drop(group):
    mean, std = group.mean(), group.std()
    inliers = (group - mean).abs() <= 2*std
    return inliers

g = df.groupby(['c1','c2','c3','c4'])

s1 = g.c5.transform(replace)
res1 = s1.groupby(g.grouper).agg([np.mean, np.std, len])

mask = g.c5.apply(drop)
res2 = df[mask].groupby(['c1','c2','c3','c4']).c5.agg([np.mean, np.std, len])

您还可以计算回调函数中的agg:

def func(group):
    mean, std = group.mean(), group.std()
    inliers = (group - mean).abs() <= 2*std
    tmp = group[inliers]
    return {"mean":tmp.mean(), "std":tmp.std(), "len":tmp.shape[0]}

g.c5.apply(func).unstack()