我试图找出标准偏差的秒数异常值。我有两个数据帧,如下所示。我试图找到的异常值与星期几的平均值相差1.5个标准偏差?当前代码低于数据帧。
DF1:
name dateTime Seconds
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
joe 2015-01-02 13:13:13 12345.0101
当前输出:df2
name day standardDev mean count
Joe mon 22326.502700 40900.730647 1886
tue 9687.486726 51166.213836 159
john mon 10072.707891 41380.035108 883
tue 5499.475345 26985.938776 196
预期产出:
df2
name day standardDev mean count events
Joe mon 22326.502700 40900.730647 1886 [2015-02-04 12:12:12, 2015-02-04 12:12:13]
tue 9687.486726 51166.213836 159 [2015-02-04 12:12:12, 2015-02-04 12:12:14]
john mon 10072.707891 41380.035108 883 [2015-01-02 13:13:13, 2015-01-02 13:13:15]
tue 5499.475345 26985.938776 196 [2015-01-02 13:13:13, 2015-01-02 13:13:18]
代码:
allFiles = glob.glob(folderPath + "/*.csv")
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_, index_col=None, names=['EventTime', "IpAddress", "Hostname", "TargetUserName", "AuthenticationPackageName", "TargetDomainName", "EventReceivedTime"])
df = df.ix[1:]
list_.append(df)
df = pd.concat(list_)
df['DateTime'] = pd.to_datetime(df['EventTime'])
df['day_of_week'] = df.DateTime.dt.strftime('%a')
df['seconds'] = pd.to_timedelta(df.DateTime.dt.time.astype(str)).dt.seconds
print(df.groupby((['TargetUserName', 'day_of_week'])).agg({'seconds': {'mean': lambda x: (x.mean()), 'std': lambda x: (np.std(x)), 'count': 'count'}}))
答案 0 :(得分:1)
这是pandas docs的略微改编。我没有创建平均值和列的列。 std,但如果你想看到它,你可以很容易地添加它。
np.random.seed(1111)
df=pd.DataFrame({ 'name': ['joe','john']*30,
'dateTime': pd.date_range('1-1-2015',periods=60),
'Seconds': np.random.randn(60)+5000. })
grp = df.groupby(['name',df.dateTime.dt.dayofweek])['Seconds']
df['zscore'] = grp.transform( lambda x: (x-x.mean())/x.std())
df[ df['zscore'].abs() > 1.5 ]
Out[79]:
Seconds dateTime name zscore
1 4998.927011 2015-01-02 john -1.522488
42 5001.275866 2015-02-12 joe 1.636829
58 4999.124550 2015-02-28 joe -1.624945
df.head(10)
Out[80]:
Seconds dateTime name zscore
0 4998.699990 2015-01-01 joe -0.959960
1 4998.927011 2015-01-02 john -1.522488
2 5000.790199 2015-01-03 joe 0.263690
3 4999.121735 2015-01-04 john -1.005137
4 5001.501822 2015-01-05 joe 1.132407
5 4999.976071 2015-01-06 john 0.678951
6 5000.275949 2015-01-07 joe 0.650297
7 4999.033607 2015-01-08 john -0.964222
8 4998.419685 2015-01-09 joe -1.328744
9 4999.796325 2015-01-10 john 1.224198