I try to make predictions - whether an item from an online retailer will be returned or not - and fail to transform my unknown data set (which I want to predict on) such that it fits to the training data of my model. To be more specific, I calculate the weight of evidence (woe) based on a split of my training data and when I want to replace the woe in the unknown data I receive an error. But let me go through this step by step.
The known dataset (which will be split to test & trainint) is daten
. Relevant are the id columns brand_id
, item_id
, user_id
and the binary dependent variable "return" (1 = will be returned, 0 = customer will not return item). All "id"-columns are factors and have >200 levels.
In the following, the original data will be prepped first. Then the WoE model will be trained before the unknown data is prepped and the WoE used for that.
Prepping of the variable that every id which occurred just once will be put into the factor "New". daten
includes all my observations.
# ---- Item_ID ---- (same for brand_id & user_id)
levels(daten$item_id) <- c(levels(factor(daten$item_id)),"New")
daten$item_id[daten$item_id %in% names(table(daten$item_id))[table(daten$item_id) == 1]] <- factor("New")
daten$item_id <- factor(daten$item_id)
Now I split daten
to test & train set. The train set is splitted another time to receive a set which is only used to calculate the weight of evidence.
# ---- Training & Test ----
set.seed(111)
idx.train <- createDataPartition(y = daten$return, p = 0.75, list = FALSE)
test <- daten[-idx.train, ] # test set
train <- daten[idx.train, ] # training set
set.seed(112)
woe.idx.train <- createDataPartition(y=train$return, p = 0.7, list = FALSE)
train.split <- train[woe.idx.train,]
Now I train the woe model.
woe.values_ids <- woe(return ~ item_id+brand_id+user_id, data=train.split, zeroadj=0.05)
Step: Train the training and test set such that the IDs will be replaced by the respective woes. (The predict function is predict.woe
from the klaR package)
test.2 <-predict(woe.values_ids, newdata=test, replace=TRUE)
train.2 <-predict(woe.values_ids, newdata=train, replace=TRUE)
Now we skip to the unknown dataset, which is called "nd" (=new data). "Nd" has columns with brand_id
, item_id
, and user_id
(also all factors), but no "return" column. I start with prepping the IDs, such that IDs which are new and haven't been used for the woe calculation will be put as the factor "New" (which exists in the training data as well). Here the code only for item_id
:
levels(nd$item_id) <- c(levels(factor(nd$item_id)),"New")
nd$item_id[!(nd$item_id %in% woe.values_ids$xlevels$item_id)] <- factor("New")
nd$item_id <- factor(nd$item_id, levels = levels(train.split$item_id))
In the last step I want to calculate the woe for "nd" based on woe.values_ids
(which was trained based on a part of the training set) but I always receive errors that the levels of the IDs don't match or after some changes I receive the following:
final <- predict(woe.values_ids, newdata=nd, replace = TRUE)
Error in if (sum(sapply(unique(x.vec), function(x) return(sum(x ==
unique(names(woe.obj))) == :
missing value where TRUE/FALSE needed
In total, I understand the process that the WoE model is build on basis of all factors from the train.split
set. When I want to apply it to a new set, then no new factors are allowed in this set. By setting all new factors to the factor "New", which exists in the train.split
dataset I want to solve that problem (see Step 5).
Nonetheless, it doesn't work. Is it maybe because I don't have a return column in my new data set? But in my understanding this column shouldn't be relevant when only applying the known WoE model to data.
答案 0 :(得分:0)
Like the error "missing value where TRUE/FALSE needed" says I had some NAs in my data which the function wasn't able to process.
So I searched all my objects for NAs and finally found some in the user_id. Hence I checked the filling of the user_id and did some small corrections which lead to a propper solution!