我正在努力寻找个人工作时间表的异常值(主要是高变化)。试图找到,如果有人来到或离开个人(上午8:30至下午5点)或组正常(上午7点至下午6点)。我尝试使用标准差,但问题是,
有没有已知的方法可以在工作时间表中查找异常值?我试图搜索,但我得到的只是时间序列中的异常值。但我正在寻找及时的异常值。有什么建议吗?
注意:我的数据集有PersonID和多次(滑动)次/天/ PersonID。我正在使用python 2.7。
答案 0 :(得分:1)
如果我理解正确的话,那么你正在寻找那些与他们自己和整体规范相比极早离开或迟到的人。
另外,我建议每天查看每日工作时间 - 每天的到达和离开时间差异作为单独的指标。
下面我有一个方向性的方法/建议来解决你的问题,python3(对不起) 它应该解决您提到的问题,但不会添加我认为您应该包含的每日小时数。
这是您可以期待的输出:
Outlier PersonIDs based on overall data
array([ 1., 4., 7., 8.])
Outlier PersonIDs based on each user's data and overall deviation
array([ 1., 3., 4., 5., 7., 8., 9.])
以下是代码:
#! /usr/bin/python3
import random
import pandas as pd
import numpy as np
import scipy.stats
import pprint
pp = pprint.PrettyPrinter(indent=4)
# Visualize:
import matplotlib.pyplot as plt
#### Create Sample Data START
# Parameters:
TimeInExpected=8.5 # 8:30am
TimeOutExpected=17 # 5pm
sig=1 # 1 hour variance
Persons=11
# Increasing the ratio between sample size and persons will make more people outliers.
SampleSize=20
Accuracy=1 # Each hour is segmented by hour tenth (6 minutes)
# Generate sample
SampleDF=pd.DataFrame([
np.random.randint(1,Persons,size=(SampleSize)),
np.around(np.random.normal(TimeInExpected, sig,size=(SampleSize)),Accuracy),
np.around(np.random.normal(TimeOutExpected, sig,size=(SampleSize)),Accuracy)
]).T
SampleDF.columns = ['PersonID', 'TimeIn','TimeOut']
# Visualize
plt.hist(SampleDF['TimeIn'],rwidth=0.5,range=(0,24))
plt.hist(SampleDF['TimeOut'],rwidth=0.5,range=(0,24))
plt.xticks(np.arange(0,24, 1.0))
plt.xlabel('Hour of day')
plt.ylabel('Arrival / Departure Time Frequency')
plt.show()
#### Create Sample Data END
#### Analyze data
# Threshold distribution percentile
OutlierSensitivity=0.05 # Will catch extreme events that happen 5% of the time. - one sided! i.e. only late arrivals and early departures.
presetPercentile=scipy.stats.norm.ppf(1-OutlierSensitivity)
# Distribution feature and threshold percentile
argdictOverall={
"ExpIn":SampleDF['TimeIn'].mode().mean().round(1)
,"ExpOut":SampleDF['TimeOut'].mode().mean().round(1)
,"sigIn":SampleDF['TimeIn'].var()
,"sigOut":SampleDF['TimeOut'].var()
,"percentile":presetPercentile
}
OutlierIn=argdictOverall['ExpIn']+argdictOverall['percentile']*argdictOverall['sigIn']
OutlierOut=argdictOverall['ExpOut']-argdictOverall['percentile']*argdictOverall['sigOut']
# Overall
# See all users with outliers - overall
Outliers=SampleDF["PersonID"].loc[(SampleDF['TimeIn']>OutlierIn) | (SampleDF['TimeOut']<OutlierOut)]
# See all observations with outliers - Overall
# pp.pprint(SampleDF.loc[(SampleDF['TimeIn']>OutlierIn) | (SampleDF['TimeOut']<OutlierOut)].sort_values(["PersonID"]))
# Sort and remove NAs
Outliers=np.sort(np.unique(Outliers))
# Show users with overall outliers:
print("Outlier PersonIDs based on overall data")
pp.pprint(Outliers)
# For each
OutliersForEach=[]
for Person in SampleDF['PersonID'].unique():
# Person specific dataset
SampleDFCurrent=SampleDF.loc[SampleDF['PersonID']==Person]
# Distribution feature and threshold percentile
argdictCurrent={
"ExpIn":SampleDFCurrent['TimeIn'].mode().mean().round(1)
,"ExpOut":SampleDFCurrent['TimeOut'].mode().mean().round(1)
,"sigIn":SampleDFCurrent['TimeIn'].var()
,"sigOut":SampleDFCurrent['TimeOut'].var()
,"percentile":presetPercentile
}
OutlierIn=argdictCurrent['ExpIn']+argdictCurrent['percentile']*argdictCurrent['sigIn']
OutlierOut=argdictCurrent['ExpOut']-argdictCurrent['percentile']*argdictCurrent['sigOut']
if SampleDFCurrent['TimeIn'].max()>OutlierIn or SampleDFCurrent['TimeOut'].min()<OutlierOut:
Outliers=np.append(Outliers,Person)
# Sort and get unique values
Outliers=np.sort(np.unique(Outliers))
# Show users with overall outliers:
print("Outlier PersonIDs based on each user's data and overall deviation")
pp.pprint(Outliers)