我需要对我的数据使用Hampel过滤器,剥离异常值。
我还没能在Python中找到现有的;仅在Matlab和R中。
[matlab功能描述] [1]
[Mats Hampel函数的统计交换讨论] [2]
[R pracma package vignette;包含hampel函数] [3]
我已经编写了以下函数,将其建模在R pracma包中的函数中;然而,它远远慢于Matlab版本。这不理想;我会很感激如何加快速度。
该功能如下所示 -
def hampel(x,k, t0=3):
'''adapted from hampel function in R package pracma
x= 1-d numpy array of numbers to be filtered
k= number of items in window/2 (# forward and backward wanted to capture in median filter)
t0= number of standard deviations to use; 3 is default
'''
n = len(x)
y = x #y is the corrected series
L = 1.4826
for i in range((k + 1),(n - k)):
if np.isnan(x[(i - k):(i + k+1)]).all():
continue
x0 = np.nanmedian(x[(i - k):(i + k+1)])
S0 = L * np.nanmedian(np.abs(x[(i - k):(i + k+1)] - x0))
if (np.abs(x[i] - x0) > t0 * S0):
y[i] = x0
return(y)
"实践中的R实施"包,我用作模型:
function (x, k, t0 = 3)
{
n <- length(x)
y <- x
ind <- c()
L <- 1.4826
for (i in (k + 1):(n - k)) {
x0 <- median(x[(i - k):(i + k)])
S0 <- L * median(abs(x[(i - k):(i + k)] - x0))
if (abs(x[i] - x0) > t0 * S0) {
y[i] <- x0
ind <- c(ind, i)
}
}
list(y = y, ind = ind)
}
任何帮助提高函数效率的帮助,或者指向现有Python模块中现有实现的指针都将非常受欢迎。以下示例数据; Jupyter中的%% timeit cell magic表示它目前需要15秒才能运行:
vals=np.random.randn(250000)
vals[3000]=100
vals[200]=-9000
vals[-300]=8922273
%%timeit
hampel(vals, k=6)
[1]:https://www.mathworks.com/help/signal/ref/hampel.html [2]:https://dsp.stackexchange.com/questions/26552/what-is-a-hampel-filter-and-how-does-it-work [3]:https://cran.r-project.org/web/packages/pracma/pracma.pdf
答案 0 :(得分:4)
熊猫解决方案的速度提高了几个数量级:
def hampel(vals_orig, k=7, t0=3):
'''
vals: pandas series of values from which to remove outliers
k: size of window (including the sample; 7 is equal to 3 on either side of value)
'''
#Make copy so original not edited
vals=vals_orig.copy()
#Hampel Filter
L= 1.4826
rolling_median=vals.rolling(k).median()
difference=np.abs(rolling_median-vals)
median_abs_deviation=difference.rolling(k).median()
threshold= t0 *L * median_abs_deviation
outlier_idx=difference>threshold
vals[outlier_idx]=np.nan
return(vals)
此时间为11 ms vs 15秒;巨大的进步。
中找到了类似过滤器的解决方案答案 1 :(得分:0)
上面的@EHB解决方案很有帮助,但这是不正确的。具体来说,在 median_abs_deviation 中计算的滚动平均值是 difference ,这本身就是每个数据点与在 rolling_median 中计算的滚动平均值之间的差,但是它应该是滚动窗口中的数据与窗口上的数据之间的差的中值。我将上面的代码修改为:
def hampel(vals_orig, k=7, t0=3):
'''
vals: pandas series of values from which to remove outliers
k: size of window (including the sample; 7 is equal to 3 on either side of value)
'''
#Make copy so original not edited
vals = vals_orig.copy()
#Hampel Filter
L = 1.4826
rolling_median = vals.rolling(window=k, center=True).median()
MAD = lambda x: np.median(np.abs(x - np.median(x)))
rolling_MAD = vals.rolling(window=k, center=True).apply(MAD)
threshold = t0 * L * rolling_MAD
difference = np.abs(vals - rolling_median)
'''
Perhaps a condition should be added here in the case that the threshold value
is 0.0; maybe do not mark as outlier. MAD may be 0.0 without the original values
being equal. See differences between MAD vs SDV.
'''
outlier_idx = difference > threshold
vals[outlier_idx] = np.nan
return(vals)