查找数据异常值

时间:2017-01-07 22:27:13

标签: python python-3.x pandas statistics

我试图找出标准偏差的秒数异常值。我有两个数据帧,如下所示。我试图找到的异常值与星期几的平均值相差1.5个标准偏差?当前代码低于数据帧。

DF1:

name    dateTime              Seconds
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
joe     2015-01-02 13:13:13   12345.0101

当前输出:df2

name   day   standardDev        mean           count
Joe    mon   22326.502700       40900.730647   1886
       tue   9687.486726        51166.213836   159
john   mon   10072.707891       41380.035108   883
       tue   5499.475345        26985.938776   196

预期产出:

df2

name   day   standardDev        mean           count     events
Joe    mon   22326.502700       40900.730647   1886      [2015-02-04 12:12:12, 2015-02-04 12:12:13]
       tue   9687.486726        51166.213836   159       [2015-02-04 12:12:12, 2015-02-04 12:12:14]
john   mon   10072.707891       41380.035108   883       [2015-01-02 13:13:13, 2015-01-02 13:13:15]
       tue   5499.475345        26985.938776   196       [2015-01-02 13:13:13, 2015-01-02 13:13:18]

代码:

allFiles = glob.glob(folderPath + "/*.csv")
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_, index_col=None, names=['EventTime', "IpAddress", "Hostname", "TargetUserName", "AuthenticationPackageName", "TargetDomainName", "EventReceivedTime"])
    df = df.ix[1:]
    list_.append(df)
df = pd.concat(list_)
df['DateTime'] = pd.to_datetime(df['EventTime'])
df['day_of_week'] = df.DateTime.dt.strftime('%a')
df['seconds'] = pd.to_timedelta(df.DateTime.dt.time.astype(str)).dt.seconds
print(df.groupby((['TargetUserName', 'day_of_week'])).agg({'seconds': {'mean': lambda x: (x.mean()), 'std': lambda x: (np.std(x)), 'count': 'count'}}))

1 个答案:

答案 0 :(得分:1)

这是pandas docs的略微改编。我没有创建平均值和列的列。 std,但如果你想看到它,你可以很容易地添加它。

np.random.seed(1111)
df=pd.DataFrame({ 'name':     ['joe','john']*30, 
                  'dateTime': pd.date_range('1-1-2015',periods=60),
                  'Seconds':  np.random.randn(60)+5000. })

grp = df.groupby(['name',df.dateTime.dt.dayofweek])['Seconds']
df['zscore'] = grp.transform( lambda x: (x-x.mean())/x.std())

df[ df['zscore'].abs() > 1.5 ]
Out[79]: 
        Seconds   dateTime  name    zscore
1   4998.927011 2015-01-02  john -1.522488
42  5001.275866 2015-02-12   joe  1.636829
58  4999.124550 2015-02-28   joe -1.624945

df.head(10)
Out[80]:
       Seconds   dateTime  name    zscore
0  4998.699990 2015-01-01   joe -0.959960
1  4998.927011 2015-01-02  john -1.522488
2  5000.790199 2015-01-03   joe  0.263690
3  4999.121735 2015-01-04  john -1.005137
4  5001.501822 2015-01-05   joe  1.132407
5  4999.976071 2015-01-06  john  0.678951
6  5000.275949 2015-01-07   joe  0.650297
7  4999.033607 2015-01-08  john -0.964222
8  4998.419685 2015-01-09   joe -1.328744
9  4999.796325 2015-01-10  john  1.224198