Question

我有一个数据集162 x 152.我想要做的是使用逐步回归，在数据集上加入交叉验证来创建模型并测试模型的准确程度。

ID  RT (seconds)    76_TI2  114_DECC    120_Lop 212_PCD 236_X3Av
4281    38  4.086   1.2 2.322   0   0.195
4952    40  2.732   0.815   1.837   1.113   0.13
4823    41  4.049   1.153   2.117   2.354   0.094
3840    41  4.049   1.153   2.117   3.838   0.117
3665    42  4.56    1.224   2.128   2.38    0.246
3591    42  2.96    0.909   1.686   0.972   0.138

这是我拥有的数据集的一部分。我想构建一个模型，其中我的Y变量是RT（秒），我的所有变量（我的预测变量）都是我数据集中的其他151个变量。有人告诉我使用superleaner包，算法就是： -

test <- CV.SuperLearner(Y = Y, X = X, V = 10, SL.library = SL.library,
verbose = TRUE, method = "method.NNLS")

问题在于我还是R的新手。我在我的数据中读取数据并执行其他形式的机器学习算法的主要方式是执行以下操作： -

mydata <- read.csv("filepathway")
fit <- lm(RT..seconds~., data=mydata)

那么我该如何将RT秒列与数据输入分开，以便我可以将事物输入为X和Y数据帧？也就是说： -

mydata <- read.csv("filepathway")
mydata$RT..seconds. = Y         #separating my Y response variable
Alltheother151variables = X     #separating all of my X predictor variables (all 151 of them)
SL.library <- c("SL.step")
test <- CV.SuperLearner(Y (i.e RT seconds column), X (all the other 151 variables that corresponds to the RT values), V = 10, SL.library = SL.library,
verbose = TRUE, method = "method.NNLS")

我希望这一切都有道理。谢谢！

Answer 1

如果响应变量在第一列中，您只需使用：

Y <- mydata[ ,  1 ]
X <- mydata[ , -1 ]

[的第一个参数（行号）是空的，所以我们保留所有的行，第二个是1（第一列）或-1（除第一列外的所有内容）。

如果你的响应变量在别处，你可以改用列名：

Y <- mydata[ , "RT..seconds." ]
X <- mydata[ , setdiff( colnames(mydata), "RT..seconds." ) ]

如何将R中的数据帧分成两个独立的数据帧，用于SuperLearner中的逐步回归

1 个答案: