我正在尝试实施折刀方法来计算大量数据(几百万个数据点)的平均值及其相应的方差。由于我有大量的数据,如果每次只剩下一个元素,那就没什么用了。我有单个元素遗漏的代码:
def jackknife(x, func):
"""Jackknife estimate of the estimator func"""
x = np.asarray(x)
n = len(x)
idx = np.arange(n)
return np.sum(func(x[idx!=i]) for i in range(n))/float(n)
def jackknife_var(x, func):
"""Jackknife estiamte of the variance of the estimator func."""
x = np.asarray(x)
n = len(x)
idx = np.arange(n)
j_est = jackknife(x, func)
return j_est, (n-1)/(n + 0.0) * np.sum((func(x[idx!=i]) - j_est)**2.0 for i in range(n))
考虑到庞大的数据集,这是非常缓慢的。任何人都知道如何有效地实施10%数据遗漏折叠方法?
答案 0 :(得分:1)
也许尝试使用sklearn.cross_validation.KFold
?
import numpy as np
from sklearn.cross_validation import KFold
import time
def jackknife(x, func):
"""Jackknife estimate of the estimator func"""
x = np.asarray(x)
n = len(x)
idx = np.arange(n)
return np.sum(func(x[idx!=i]) for i in range(n))/float(n)
def jackknife_v2(x, func):
"""Jackknife estimate of the estimator func"""
x = np.asarray(x)
n = len(x)
kf = KFold(n, n_folds=10)
return np.mean([np.sum(func(x[idx])) for idx, _ in kf])
x = np.random.normal(12,3, 100000)
start = time.time()
jack1 = jackknife(x, np.var)
end = time.time()
print('jackknife time elapsed: {:>10f}'.format(end-start))
start = time.time()
jack2 = jackknife_v2(x, np.var)
end = time.time()
print('jackknife_v2 time elapsed: {:>10f}'.format(end-start))
print(jack1, jack2)
## jackknife time elapsed: 59.567203
## jackknife_v2 time elapsed: 0.005295
## 8.98020789924 8.98019104673