Question

I have a dataset (~14410 rows) with observations including the country. I divide this set into train and test set and train my data using decision tree with the rpart() function. When it comes to predicting, sometimes I get the error that test set has countries which are not in train set.

At first I excluded/deleted the countries which appeared only once:

# Get orderland with frequency one
var.names <- names(table(mydata1$country))[table(mydata1$country) == 1]
loss <- match(var.names, mydata1$country)
names(which(table(mydata1$country) == 1))
mydata1 <- mydata1[-loss, ]

When rerunning my code, I get the same error at the same code line, saying that I have new countries in test which are not in train. Now I did a count to see how often a country appears.

  count <- as.data.frame(count(mydata1, vars=mydata1$country))
  count[rev(order(count$n)),]

                     vars    n
3  Bundesrep. Deutschland 7616
9         Grossbritannien 1436
12                Italien  930
2                 Belgien  731
22               Schweden  611
23                Schweiz  590
13                  Japan  587
19            Oesterreich  449
17            Niederlande  354
8              Frankreich  276
18               Norwegen  238
7                Finnland  130
21               Portugal  105
5               Daenemark   65
26                Spanien   57
4                   China   55
20                  Polen   51
27                 Taiwan   31
14              Korea Süd   30
11                 Irland   26
29             Tschechien   13
16                Litauen    9
10              Hong Kong    7
30                   <NA>    3
6                 Estland    3
24                Serbien    2
1              Australien    2
28               Thailand    1
25               Singapur    1
15               Kroatien    1

From this I can see, I also have NA's in my data.

My question now is, how can I proceed with this problem? Should I exclude/delete all countries with e.g. observations < 7 or should I take the data with observations < 7 and reproduce/repeat this data two times, so my predict () function will always work, also for other data sets? It's somehow not "fancy" just to delete the rows...is there any other possibility?

Answer 1

您需要转换chr中的每个factor变量：

mydata1$country <- as.factor(mydata1$country)

然后，您可以简单地进行训练/测试拆分。您无需删除任何内容（NA除外）

通过使用类型factor，您的模型将知道观测值country将具有一些可能的levels：

示例：

country <- factor("Italy", levels = c("Italy", "USA", "UK")) # just 3 levels for example
country

[1] Italy
Levels: Italy USA UK
# note that as.factor() takes care of defining the levels for you

查看与以下内容的区别：

country <- "Italy"
country

[1] "Italy"

通过使用factor，模型将知道所有可能的levels。因此，即使在train数据中没有观测值“意大利”，模型也会知道可以在test数据中进行观测。

factor始终是模型中字符的正确类型。

Observations with low frequency go all in train set and produce error in predict ()

1 个答案: