Question

我有一个模型，该模型使用训练数据集上的cv.glm使用logistic回归进行预测，当我在testdata上进行预测并尝试生成混淆矩阵时，会引发错误。train和{{ 1}}集不平衡。

这是测试和训练数据集的维度。我的testdata和traindata均来自1234列和60行的大型数据集，我将其随机分为两组，以最后检查分类的敏感性和特异性。

testdata

这就是我尝试过的。

> dim(traindata)
   40 1234
> dim(testdata)
[1]   20 1234

在这里它抛出错误为：

Subtype   = factor(traindata$Subtype) 
CV=cv.glmnet(x=data.matrix(traindata),y=Subtype,standardize=TRUE,alpha=0,nfolds=3,family="multinomial")
response_predict=predict(CV, data.matrix(testdata),type="response")
predicted = as.factor(names(response_predict)[1:3][apply(response_predict[1:3], 1, which.max)])

我的问题是如何使用Error in apply(response_predict[1:3], 1, which.max) : dim(X) must have a positive length在这种不平衡的数据集中进行处理，以及如何摆脱上述错误。谢谢

Answer 1

不平衡与该错误无关。首先，response_predict是一个数组，不是矩阵，也不是数据帧。因此，最后一行应该是

predicted <- as.factor(colnames(response_predict[, , 1])[1:3][apply(response_predict[, 1:3, 1], 1, which.max)])

也就是说，由于我们要处理三维数组，所以我们有三个索引。 response_predict[1:3]的含义也只是三个数字，而不是三个数组列。而且由于response_predict不是数据帧，因此names不会为您提供它的列名。

但是实际上，所有这些都可以编写，假设存在三个可能的类，简单地如下

predicted <- as.factor(colnames(response_predict)[apply(response_predict, 1, which.max)])

这更干净。我想你也知道

predicted <- as.factor(predict(CV, data.matrix(testdata), type = "class"))

也给出相同的结果。

在不平衡的测试和训练数据上预测cv.glm中的模型球体时遇到错误

1 个答案: