Question

我的数据集示例如下所示：

df = data.frame(cbind(a = c(1,3,5), b = c(4,1,7), c = c(1,9,10)))
y = c(8, 9, 20)

我想找出a，b和c的最佳组合，它可以最大化所选列和y之间的相关性。

例如，找到所有这些组合中最强的相关性：

cor(df$a, y)
cor(df$b, y)
cor(df$c, y)
cor(df$a+df$b, y)
cor(df$a+df$c, y)
cor(df$b+df$c, y)
cor(df$a+df$b+df$c, y)

我目前的方法是：

combination = list()
for(i in 1:3){combination[[i]]=c(NA,1)}
names(combination) = c("a", "b", "c") 
combi = arrange(expand.grid(combination), a)

combi = mutate(combi, cor = NA)

for (i in 1:2^3){
  x = as.numeric(combi[i,])
  col = x*c(1:3)
  col = col[!is.na(col)]

  if(length(col)>1){
     t = rowSums(df[, col])
     combi[i, 4] = cor(t,y)
  }

  if(length(col)==1){
     t = df[, col]
     combi[i, 4] = cor(t,y)
  }

  if(length(col)==0){
     combi[i, 4] = NA
  }

}

有没有简单的方法来评估所有可能的组合？当列总数增加时，找到所有组合变得非常痛苦。我应该使用什么样的策略在有限的步骤中找到最佳组合（只是局部优化）？前进/后退逐步选择怎么样？

在这种情况下没有模型。通过说向前/向后逐步选择，我的意思是一种类似的方法，就像人们对回归模型所做的那样：不是一次性搜索所有可能的列组合，而是单独从每列开始，找到具有最强相关性的列。然后，仅考虑包含此列的组合。

非常感谢您的任何建议！

Answer 1

我不知道是否有包围进行整个评估，但使用combn可以提高所有可能情况的循环效率：

# basic data
df = data.frame(cbind(a = c(1,3,5), b = c(4,1,7), c = c(1,9,10)))
y = c(8, 9, 20)

# do single correlations first, since the following code with apply refuses single columns
cors<-data.frame(m=NA,cc=NA)  # define cors to collect results

for (i in 1:ncol(df)){
  cors[i,1]<-1
  cors[i,2]<-cor(df[,i],y)
}

# the following code uses combn to find all combinations and perform a function on them, with correlations as result. These are stored in cors

for (m in 2:ncol(df)){
  cv<-combn(ncol(df),m,FUN=function(x) cor(apply(df[,x],1,sum),y))
  cors[(i+1):(i+length(cv)),2]<-cv
  cors[(i+1):(i+length(cv)),1]<-m
  i<-i+length(cv)
}

print(cors)

哪个收益率：

  m        cc
1 1 0.9011271
2 1 0.8260332
3 1 0.6444459
4 2 0.9819805
5 2 0.7317957
6 2 0.9385110
7 3 0.9299975

其中m给出了组合列的数量，并给出了相关性。通过一些改进，您还可以在同一数据框中保留组合的组成，但您也可以先选出最大值，然后找出产生最大值的组合（在这种情况下，第一个值为m = 2，以combn(ncol(df),m)[,1]）

给出

R

1 个答案: