Question

我想对数据帧进行多项式特征扩展 - 例如，df与（x1，x2，x3）的二次展开应给出df（x1，x2，x3， x1 ^ 2，x2 ^ 2，x3 ^ 2，x1x2，x1x3，x2x3）。我目前正在使用poly(df$x1, df$x2, df$x3, degree=2, raw=T)，但如果我有大量列，则需要不必要的输入量。（poly(df[,1:20], degree=2, raw=T)并不起作用。）最好的方法是什么？

编辑：poly（vector is too large错误）列太多了。使用简单的for循环：

polyexp = function(df){
  df.polyexp = df
  colnames = colnames(df)
  for (i in 1:ncol(df)){
    for (j in i:ncol(df)){
      colnames = c(colnames, paste0(names(df)[i],'.',names(df)[j]))
      df.polyexp = cbind(df.polyexp, df[,i]*df[,j])
    }
  }
  names(df.polyexp) = colnames
  return(df.polyexp)
}

只需添加其他循环即可计算更高阶的术语。

Answer 1

您可以使用do.call执行此操作：

do.call(poly, c(lapply(1:20, function(x) dat[,x]), degree=2, raw=T))

基本上do.call将要调用的函数（在您的情况下为poly）作为第一个参数，并将第二个参数作为列表。然后将此列表的每个元素作为参数传递给您的函数。在这里，我们创建一个列表，其中包含您要处理的所有列（我已使用lapply获取该列表而不需要太多输入），然后是您要传递的另外两个参数。

要看一个简单的例子：

dat <- data.frame(x=1:5, y=1:5, z=2:6)
do.call(poly, c(lapply(1:3, function(x) dat[,x]), degree=2, raw=T))
#      1.0.0 2.0.0 0.1.0 1.1.0 0.2.0 0.0.1 1.0.1 0.1.1 0.0.2
# [1,]     1     1     1     1     1     2     2     2     4
# [2,]     2     4     2     4     4     3     6     6     9
# [3,]     3     9     3     9     9     4    12    12    16
# [4,]     4    16     4    16    16     5    20    20    25
# [5,]     5    25     5    25    25     6    30    30    36
# attr(,"degree")
# [1] 1 2 1 2 2 1 2 2 2

Answer 2

速度更快：

library(data.table)

###fast version of expand.grid 
expgr   = function(seq1,seq2){
cbind(rep.int(seq1, length(seq2)),c(t(matrix(rep.int(seq2,     length(seq1)), nrow=length(seq2)))))
}

###polynomial feature expansion
polyexp = function(x){
comb    = expgr(1:ncol(x),1:ncol(x))
comb    = comb[comb[,1]!=comb[,2],]
nn      = sapply(1:nrow(comb),function(y){paste(names(x)[comb[y,1]],names(x)[comb[y,2]],sep=".")})
res     = data.table(do.call("cbind",sapply(1:nrow(comb),function(y){x[,comb[y,1],with=F]*x[,comb[y,2],with=F]})))
setnames(res,nn)
}

R中的多项式特征扩展

2 个答案: