我正在使用不同的标准偏差标准对两个传球进行异常值检查。但是,我使用两个循环,它运行速度非常慢。我想知道是否有任何大熊猫“伎俩”来加速这一步。
这是我正在使用的代码(警告非常难看的代码!):
def find_outlier(point, window, n):
return np.abs(point - nanmean(window)) >= n * nanstd(window)
def despike(self, std1=2, std2=20, block=100, keep=0):
res = self.values.copy()
# First run with std1:
for k, point in enumerate(res):
if k <= block:
window = res[k:k + block]
elif k >= len(res) - block:
window = res[k - block:k]
else:
window = res[k - block:k + block]
window = window[~np.isnan(window)]
if np.abs(point - window.mean()) >= std1 * window.std():
res[k] = np.NaN
# Second run with std2:
for k, point in enumerate(res):
if k <= block:
window = res[k:k + block]
elif k >= len(res) - block:
window = res[k - block:k]
else:
window = res[k - block:k + block]
window = window[~np.isnan(window)]
if np.abs(point - window.mean()) >= std2 * window.std():
res[k] = np.NaN
return Series(res, index=self.index, name=self.name)
答案 0 :(得分:12)
我不确定你对这个块块做了什么,但是在系列中找到异常值应该像以下一样简单:
In [1]: s > s.std() * 3
其中s是你的系列,3是超出异常状态的标准偏差。此表达式将返回一系列布尔值,然后您可以通过以下方式索引系列:
In [2]: s.head(10)
Out[2]:
0 1.181462
1 -0.112049
2 0.864603
3 -0.220569
4 1.985747
5 4.000000
6 -0.632631
7 -0.397940
8 0.881585
9 0.484691
Name: val
In [3]: s[s > s.std() * 3]
Out[3]:
5 4
Name: val
<强>更新强>
解决关于阻止的评论。我认为在这种情况下你可以使用pd.rolling_std()
:
In [53]: pd.rolling_std(s, window=5).head(10)
Out[53]:
0 NaN
1 NaN
2 NaN
3 NaN
4 0.871541
5 0.925348
6 0.920313
7 0.370928
8 0.467932
9 0.391485
In [55]: abs(s) > pd.rolling_std(s, window=5) * 3
Docstring:
Unbiased moving standard deviation
Parameters
----------
arg : Series, DataFrame
window : Number of observations used for calculating statistic
min_periods : int
Minimum number of observations in window required to have a value
freq : None or string alias / date offset object, default=None
Frequency to conform to before computing statistic
time_rule is a legacy alias for freq
Returns
-------
y : type of input argument