插入:glmnet警告 - x应该是一个包含2列或更多列的矩阵

时间:2017-10-24 12:58:31

标签: r r-caret glmnet

当我将单个数字变量作为独立变量传递给插入符号中的glmnet时,我收到一条错误消息,说明" x应该是一个包含2列或更多列的矩阵"但是当我传递单个因子时变量然后列车功能按预期执行。将因子变量添加到单个数字变量也可以按预期工作。为什么是这样?到目前为止,这是非常有问题的。我知道使用glmnet你需要使用矩阵而不是数据框,但是Caret应该关注这个转换,因为它显然对因子变量有用。此外,我需要能够在插入符号框架内始终如一地实现我的分析,并且我需要将数据作为数据框架。这是一个示例,请忽略由于与此问题无关的太少观察而产生的警告消息。

当我疯了的时候,任何帮助都会受到赞赏!

df <- structure(list(Y = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
                             1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
                                                                                         "Yes"), class = "factor"), A = c("Yes", "Yes", "No", "No", "No", 
                                                                                                                          "No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "N", 
                                                                                                                          "No", "No", "No", "No", "No"), B = c(30, 6, 12, 12, 12, 12, 12, 
                                                                                                                                                               4, 12, 32, 12, 12, 4, 24, 8, 12, 15, 6, 12, 12), C = structure(c(1L, 
                                                                                                                                                                                                                                1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                                                                                                                                                                                1L, 2L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("Y", 
                                                                                                                                                                                                                                                                                                  "A", "B", "C"), row.names = c(NA, 20L), class = "data.frame")



# set up the grid
  tuneGrid <- expand.grid(.alpha = seq(0, 1, 0.05), .lambda = seq(0, 2, 0.05))
  ## 10-fold CV ##
  fitControl <- trainControl(method = 'cv', number = 10, classProbs = TRUE, summaryFunction = twoClassSummary) 

  #works with a single factor variable  (ignore warnings based on small sample size)
  train(Y ~ A, data=df[c("Y", "A")], method="glmnet", 
    family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")

  #returns and error message when a single numeric independent variable is passed
  train(Y ~ B, data=df[c("Y", "B")], method="glmnet", 
    family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")

  #works when a factor variable is added to the numeric variable (ignore warnings based on small sample size)
  train(Y ~ A + C, data=df[c("Y", "A", "C")], method="glmnet", 
    family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")

2 个答案:

答案 0 :(得分:2)

尝试使用这个技巧:

df$ones <- rep(1, nrow(df))
train(Y ~ ones+B, data=df[c("Y", "B", "ones")], method="glmnet", 
    family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")

答案 1 :(得分:2)

glmnet函数在函数顶部附近执行检查:

np = dim(x)
if (is.null(np) | (np[2] <= 1)) 
    stop("x should be a matrix with 2 or more columns")

您可以通过运行glmnet而无需任何代价来自行查看完整代码。

我认为它与一个因素一起工作的原因是,插入符已经预处理了您的数据集并在任何因子列上运行dummyVars,为每个因子级别创建一列。这在建模/机器学习中很常见,有时也称为1热编码或二进制编码。

值为'red','green'和'blue'的类型因子列将导致三列名为'red','green'和'blue'。