Question

我有两个数据集train（这个数据集包含变量date，store，item）和test（其中有{{1 }}），我已经合并为一个id, date,store, item，然后又进行了分区，因为我想最终使用df_all数据集创建一个预测train

sales的结构是

df_all

将数据划分为：

    'data.frame':   958000 obs. of  5 variables:
 $ date : Factor w/ 1916 levels "2013-01-01","2013-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ store: int  1 1 1 1 1 1 1 1 1 1 ...
 $ item : int  1 1 1 1 1 1 1 1 1 1 ...
 $ sales: num  13 11 14 13 10 12 10 9 12 9 ...
 $ id   : Factor w/ 45001 levels "0","1","10","100",..: 45001 45001 45001 45001 45001 45001 45001 45001 45001 45001 ...`

，然后使用单次编码，因为set.seed(1234) n = nrow(df_all) index = sample(1:n, size = round(0.7*n), replace=T) train = df_all[index, ] test = df_all[-index, ]是分类变量：

id

除了这是我遇到问题的地方，因为我的矩阵最终看起来像

trainm <- sparse.model.matrix(sales ~ ., data= train)[,-1]

看起来不像我所需要的稀疏矩阵，并且列的奇怪情况正在发生，而不是它们应该如何显示（即日期，id，商店，商品，销售）。因此，如果有人对如何解决此问题有任何建议，或者还有另一种解决方法，将不胜感激！

Answer 1

在分割主要数据集之前，我们应该始终进行一次热编码。原因是，有时您可能会发现测试中的值不在训练中，在这种情况下，您将无法训练/预测模型，因为总列数会不匹配。

因此，您应该这样做：

# ohe columns
df_ohe <- model.matrix(~.-1, data = df_all[,-c('id','date')])

# join id column with ohe columns
df_all_new <- cbind(df_all[,1], df_ohe)

现在，您可以将数据拆分为训练和测试。

制作稀疏矩阵（R）时缺少列

1 个答案: