我正在尝试使用客户数据库构建预测模型。
我有一个拥有3,000名客户的数据集。每个客户在测试数据集中有300个观察值和20个变量(包括因变量)。我还有一个得分数据集,每个独特的cutomer ID有50个观察值,包括19个变量(不包括因变量)。我将测试数据集放在一个单独的文件中,每个客户都由一个唯一的ID变量标识,类似于得分数据集由唯一的id变量标识。
我正在开发基于RandomForest的预测模型。以下是单个客户的示例。我不确定如何自动应用于每个客户的模型,并有效地预测和存储模型。
install.packages(randomForest)
library(randomForest)
sales <- read.csv("C:/rdata/test.csv", header=T)
sales_score <- read.csv("C:/rdata/score.csv", header=T)
## RandomForest for Single customer
sales.rf <- randomForest(Sales ~ ., ntree = 500, data = sales,importance=TRUE)
sales.rf.test <- predict(sales.rf, sales_score)
我非常熟悉SAS并开始学习R.对于SAS progremmers,有很多SAS程序可以通过组处理来实现,例如:
proc gam data = test;
by id;
model y = x1 x2 x3;
score data = test out = pred;
run;
该SAS程序将为每个唯一的iD开发一个gam模型,并将它们应用于每个唯一ID的测试集。有R等价吗?
我非常感谢任何例子或想法?
非常感谢
答案 0 :(得分:2)
假设您的sales
数据集为3,000 * 300 = 900,000
行且两个数据框都有customer_id
列,您可以执行以下操作:
pred_groups <- split(seq_len(nrow(sales_score)), sales_score$customer_id)
# pred_groups is now a list, with names the customer_id's and each list
# element an integer vector of row numbers. Now iterate over each customer
# and make predictions on the training set.
preds <- unsplit(structure(lapply(names(pred_groups), function(customer_id) {
# Train using only observations for this customer.
# Note we are comparing character to integer but R's natural type
# coercion should still give the correct answer.
train_rows <- sales$customer_id == customer_id
sales.rf <- randomForest(Sales ~ ., ntree = 500,
data = sales[train_rows, ],importance=TRUE)
# Now make predictions only for this customer.
predict(sales.rf, sales_score[pred_groups[[customer_id]], ])
}), .Names = names(pred_groups)), sales_score$customer_id)
print(head(preds)) # Should now be a vector of predicted scores of length
# the number of rows in the train set.
编辑:根据@joran,这是一个for
的解决方案:
pred_groups <- split(seq_len(nrow(sales_score)), sales_score$customer_id)
preds <- numeric(nrow(sales_score))
for(customer_id in names(pred_groups)) {
train_rows <- sales$customer_id == customer_id
sales.rf <- randomForest(Sales ~ ., ntree = 500,
data = sales[train_rows, ],importance=TRUE)
pred_rows <- pred_groups[[customer_id]]
preds[pred_rows] <- predict(sales.rf, sales_score[pred_rows, ])
})