Question

我有一个带有以下变量的训练数据集

str(PairsTrain)

'data.frame':   1495698 obs. of  4 variables:  
            $ itemID_1        : int  1 4 8 12 15 19 20 20 22 26 ...  
            $ itemID_2        : int  4112648 1223296 2161930 5637025  113701         
            $ isDuplicate     : int  1 0 1 0 0 0 0 0 1 0 ...  
            $ generationMethod: int  1 1 1 1 1 1 1 1 1 1 ...

我使用逻辑回归glm()函数

从此数据集中学习

mod1 <- glm(isDuplicate ~., data = PairsTrain, family = binomial)

以下是我的测试数据集的结构：

str(Test)

'data.frame':   1044196 obs. of  3 variables:  
         $ id      : int  0 1 2 3 4 5 6 7 8 9 ...  
         $ itemID_1: int  5 5 6 11 23 23 30 31 36 47 ...  
         $ itemID_2: int  4670875 787210 1705280 3020777 5316130 3394969 2922567

我正在尝试对我的测试数据集进行预测，如下所示

PredTest <- predict(mod1, newdata = Test, type = "response")

eval中的错误（expr，envir，enclos）：找不到对象'generationMethod'

我收到上述错误。我认为我得到的错误的原因是我的测试数据集中的功能数量与训练数据集不匹配。

我不确定我是否正确，我被困在这里，不知道如何应对这种情况。

Answer 1

好的，这就是你所需要的：

test$generationMethod <- 0

generationMethod中必须有变量test！它已被用于构建模型，因此在进行预测时predict需要它。正如您所说，test中没有此变量，请使用上述内容在test中创建此类变量。这对预测没有影响，因为这个列全为0;但是，它会帮助您通过predict传递变量检查。

或者，您可以考虑从模型开发中删除变量generationMethod。尝试：

mod2 <- glm(isDuplicate ~ itemID_1 + itemID_2, data = PairsTrain,
            family = binomial)

predict（）错误：如果训练数据中存在一个变量而预测数据中没有变量，我该怎么办？

1 个答案: