Observations with low frequency go all in train set and produce error in predict ()

时间:2018-10-05 09:17:14

标签: r decision-tree predict training-data

I have a dataset (~14410 rows) with observations including the country. I divide this set into train and test set and train my data using decision tree with the rpart() function. When it comes to predicting, sometimes I get the error that test set has countries which are not in train set.

At first I excluded/deleted the countries which appeared only once:

# Get orderland with frequency one
var.names <- names(table(mydata1$country))[table(mydata1$country) == 1]
loss <- match(var.names, mydata1$country)
names(which(table(mydata1$country) == 1))
mydata1 <- mydata1[-loss, ]

When rerunning my code, I get the same error at the same code line, saying that I have new countries in test which are not in train. Now I did a count to see how often a country appears.

  count <- as.data.frame(count(mydata1, vars=mydata1$country))

                     vars    n
3  Bundesrep. Deutschland 7616
9         Grossbritannien 1436
12                Italien  930
2                 Belgien  731
22               Schweden  611
23                Schweiz  590
13                  Japan  587
19            Oesterreich  449
17            Niederlande  354
8              Frankreich  276
18               Norwegen  238
7                Finnland  130
21               Portugal  105
5               Daenemark   65
26                Spanien   57
4                   China   55
20                  Polen   51
27                 Taiwan   31
14              Korea Süd   30
11                 Irland   26
29             Tschechien   13
16                Litauen    9
10              Hong Kong    7
30                   <NA>    3
6                 Estland    3
24                Serbien    2
1              Australien    2
28               Thailand    1
25               Singapur    1
15               Kroatien    1

From this I can see, I also have NA's in my data.

My question now is, how can I proceed with this problem? Should I exclude/delete all countries with e.g. observations < 7 or should I take the data with observations < 7 and reproduce/repeat this data two times, so my predict () function will always work, also for other data sets? It's somehow not "fancy" just to delete the rows...is there any other possibility?

1 个答案:

答案 0 :(得分:0)


mydata1$country <- as.factor(mydata1$country)




country <- factor("Italy", levels = c("Italy", "USA", "UK")) # just 3 levels for example

[1] Italy
Levels: Italy USA UK
# note that as.factor() takes care of defining the levels for you


country <- "Italy"

[1] "Italy"


