Question

我正在为Kaggle比赛开发一个CTR预测模型（link）。我已经阅读了训练集中的前100,000行数据，然后在80/20进一步将其分为训练/测试集

ad_data <- read.csv("train", header = TRUE, stringsAsFactors = FALSE, nrows = 100000)
trainIndex <- createDataPartition(ad_data$click, p=0.8, list=FALSE, times=1)
ad_train <- ad_data[trainIndex,]
ad_test <- ad_data[-trainIndex,]

然后我使用ad_train数据开发GLM模型

ad_glm_model <- glm(ad_train$clicks ~ ad_train$C1 + ad_train$site_category + ad_train$device_type, family = binomial(link = "logit"), data = ad_train)

但是每当我尝试使用预测函数来检查它在ad_test集上的效果时，我都会收到错误：

test_model <- predict(ad_glm_model, newdata = ad_test, type = "response")
Warning message:
'newdata' had 20000 rows but variables found have 80000 rows

是什么给出的？如何在新数据上测试我的GLM模型？

编辑：它完美无缺。只需要做这个调用：

ad_glm_model <- glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)

Answer 1

这种情况正在发生，因为您在模型公式中包含每个变量的数据框名称。相反，您的公式应为：

glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)

如重复通知中的second link所述：

这是在您的数据和您的数据之间使用不同名称的问题   newdata并没有使用向量或数据帧之间的问题。

当你使用lm函数拟合模型然后使用predict来制作   预测，预测尝试在新数据上找到相同的名称。在   你的第一个案例名称x与mtcars $ wt冲突，因此你得到了   警告。

如何在R中测试逻辑回归模型？

1 个答案: