Question

亲爱的朋友如果有人可以帮我解决一些问题，我将不胜感激。我有一个包含8个变量的数据框，比方说（v1，v2，...，v8）。我想根据这些变量的所有可能组合生成数据集组。也就是说，通过一组8个变量，我能够生成2 ^ 8-1 = 63个变量子集，如{v1}，{v2}，...，{v8}，{v1，v2}，... ，{V1，V2，V3}，....，{V1，V2，...，V8} 我的目标是根据这些分组生成特定的统计量，然后比较哪个子集产生更好的统计量。我的问题是如何制作这些组合。提前谢谢

Answer 1

您需要combn功能。它会创建您提供的矢量的所有组合。例如，在您的示例中：

names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)

这为您提供了一次3个V1-V8的所有排列。

Answer 2

我将使用data.table代替data.frame;

我将为健壮性添加一个无关变量。

这将为您提供子集化数据框：

nn<-8L

dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
     c("id",paste0("V",1:nn)))

#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
         rep(rep(c(0,1),each=64),2),
         rep(rep(c(0,1),each=32),4),
         rep(rep(c(0,1),each=16),8),
         rep(rep(c(0,1),each=8),16),
         rep(rep(c(0,1),each=4),32),
         rep(rep(c(0,1),each=2),64),
         rep(c(0,1),128)) * 
  t(matrix(rep(1:nn),2^nn,nrow=nn))

#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})

#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})

你说你想要从每个子集中获得一些统计数据，在这种情况下，将最后一行指定为：

可能更有用。

ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})

#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

如何从R中的数据框组成变量？

2 个答案: