Question

我正在使用线性回归来处理具有许多分类变量的数据集，每个分类变量包含多个类别，其中一个类别最多包含45个类别。

我正在以这种方式对数据进行采样：

## 70% of the sample size
smp_size <- floor(0.7 * nrow(plot_data))
## set the seed to make your partition reproductible
set.seed(888)
train_ind <- sample(seq_len(nrow(plot_data)), size = smp_size)

train <- plot_data[train_ind, ]
test <- plot_data[-train_ind, ]

然后我制作这样的模型：

linear_model = lm(train$dependent_variable~., data = train)

问题在于，每当我尝试预测和使用测试集时，训练集都包含一些训练集没有的类别。

pred_data = predict(linear_model, newdata = test)

这给了我以下错误：

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor origin has new levels someCategory1, SomeCategory2

有没有办法确保所有类别都在列车和测试集中，或者是否有解决方法？

Answer 1

我最终删除了测试集上新级别的观察结果。我知道它有它的局限性并且OSR2失去了可靠性，但它完成了工作：

test = na.omit(remove_missing_levels (fit=linear_model, test_data=test));

我找到了remove_missing_levels函数here。

它需要这个库：

install.packages("magrittr");
library(magrittr);

对许多类别进行抽样

1 个答案: