我一直在尝试在R中创建一个Randomized Lasso函数,但它似乎没有产生与Python sklearn随机套索函数相同的结果。我在这里运用相同的哲学,但无法理解其中的差异。代码已根据以下代码进行了修改:randomized lasso function in R.
以下是代码和示例数据:
# generate synthetic data
set.seed(100)
size = 750
x = matrix(runif(14*size),ncol=14)
y = 10 * sin(pi*X[,1]*X[,2]) + 20*(X[,3]-0.5)**2 + 10*X[,4] + 5*X[,5] + runif(1,0,1)
nbootstrap = 200
nsteps = 20
alpha = 0.2
dimx <- dim(x)
n <- dimx[1]
p <- dimx[2]
halfsize <- as.integer(n/2)
freq <- matrix(0,1,p)
for (i in seq(nbootstrap)) {
# Randomly reweight each variable
xs <- t(t(x)*runif(p,alpha,1))
# Ramdomly split the sample in two sets
perm <- sample(dimx[1])
i1 <- perm[1:halfsize]
i2 <- perm[(halfsize+1):n]
# run the randomized lasso on each sample and check which variables are selected
cv_lasso <- lars::cv.lars(xs[i1,],y[i1],plot.it=FALSE, mode = 'step')
idx <- which.max(cv_lasso$cv - cv_lasso$cv.error <= min(cv_lasso$cv))
coef.lasso <- coef(lars::lars(xs[i1,],y[i1]))[idx,]
freq <- freq + abs(sign(coef.lasso))
cv_lasso <- lars::cv.lars(xs[i2,],y[i2],plot.it=FALSE, mode = 'step')
idx <- which.max(cv_lasso$cv - cv_lasso$cv.error <= min(cv_lasso$cv))
coef.lasso <- coef(lars::lars(xs[i1,],y[i1]))[idx,]
freq <- freq + abs(sign(coef.lasso))
print(freq)
}
# normalize frequence in [0,1]
freq <- freq/(2*nbootstrap)
结果应该与此表中显示的结果相似(稳定性)stability in python.但是,这种方法和原始R代码显示在第一个超链接引用中没有找到相关的功能X11到X14。不确定哪个部分在我的R代码中不能正常工作。
答案 0 :(得分:3)
首先感谢您发布此问题。我喜欢在查看代码和参考资料时学习稳定性选择。其次,当你看到这个答案时,你可能会踢自己。我认为您的代码是有效的,但您忘记创建“Friedamn#1回归数据集”的强相关功能。第二个链接的python代码如下:
#"Friedamn #1” regression problem
Y = (10 * np.sin(np.pi*X[:,0]*X[:,1]) + 20*(X[:,2] - .5)**2 +
10*X[:,3] + 5*X[:,4] + np.random.normal(0,1))
#Add 3 additional correlated variables (correlated with X1-X3)
X[:,10:] = X[:,:4] + np.random.normal(0, .025, (size,4))
您的代码不包含第二步。因此,除了前几个特征之外的所有特征都是噪声,并且正确地从稳定性选择算法中排除。