当我将单个数字变量作为独立变量传递给插入符号中的glmnet时,我收到一条错误消息,说明" x应该是一个包含2列或更多列的矩阵"但是当我传递单个因子时变量然后列车功能按预期执行。将因子变量添加到单个数字变量也可以按预期工作。为什么是这样?到目前为止,这是非常有问题的。我知道使用glmnet你需要使用矩阵而不是数据框,但是Caret应该关注这个转换,因为它显然对因子变量有用。此外,我需要能够在插入符号框架内始终如一地实现我的分析,并且我需要将数据作为数据框架。这是一个示例,请忽略由于与此问题无关的太少观察而产生的警告消息。
当我疯了的时候,任何帮助都会受到赞赏!
df <- structure(list(Y = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), A = c("Yes", "Yes", "No", "No", "No",
"No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "N",
"No", "No", "No", "No", "No"), B = c(30, 6, 12, 12, 12, 12, 12,
4, 12, 32, 12, 12, 4, 24, 8, 12, 15, 6, 12, 12), C = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
1L, 2L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("Y",
"A", "B", "C"), row.names = c(NA, 20L), class = "data.frame")
# set up the grid
tuneGrid <- expand.grid(.alpha = seq(0, 1, 0.05), .lambda = seq(0, 2, 0.05))
## 10-fold CV ##
fitControl <- trainControl(method = 'cv', number = 10, classProbs = TRUE, summaryFunction = twoClassSummary)
#works with a single factor variable (ignore warnings based on small sample size)
train(Y ~ A, data=df[c("Y", "A")], method="glmnet",
family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")
#returns and error message when a single numeric independent variable is passed
train(Y ~ B, data=df[c("Y", "B")], method="glmnet",
family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")
#works when a factor variable is added to the numeric variable (ignore warnings based on small sample size)
train(Y ~ A + C, data=df[c("Y", "A", "C")], method="glmnet",
family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")
答案 0 :(得分:2)
尝试使用这个技巧:
df$ones <- rep(1, nrow(df))
train(Y ~ ones+B, data=df[c("Y", "B", "ones")], method="glmnet",
family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")
答案 1 :(得分:2)
glmnet函数在函数顶部附近执行检查:
np = dim(x)
if (is.null(np) | (np[2] <= 1))
stop("x should be a matrix with 2 or more columns")
您可以通过运行glmnet
而无需任何代价来自行查看完整代码。
我认为它与一个因素一起工作的原因是,插入符已经预处理了您的数据集并在任何因子列上运行dummyVars
,为每个因子级别创建一列。这在建模/机器学习中很常见,有时也称为1热编码或二进制编码。
值为'red','green'和'blue'的类型因子列将导致三列名为'red','green'和'blue'。