随机森林按群体处理/评分

时间:2014-03-23 01:19:44

标签: r sas grouping prediction random-forest

我正在尝试使用客户数据库构建预测模型。

我有一个拥有3,000名客户的数据集。每个客户在测试数据集中有300个观察值和20个变量(包括因变量)。我还有一个得分数据集,每个独特的cutomer ID有50个观察值,包括19个变量(不包括因变量)。我将测试数据集放在一个单独的文件中,每个客户都由一个唯一的ID变量标识,类似于得分数据集由唯一的id变量标识。

我正在开发基于RandomForest的预测模型。以下是单个客户的示例。我不确定如何自动应用于每个客户的模型,并有效地预测和存储模型。

    install.packages(randomForest)
    library(randomForest)
    sales <- read.csv("C:/rdata/test.csv", header=T)
    sales_score <- read.csv("C:/rdata/score.csv", header=T)

  ## RandomForest for Single customer

    sales.rf <- randomForest(Sales ~ ., ntree = 500, data = sales,importance=TRUE)
    sales.rf.test <- predict(sales.rf, sales_score)

我非常熟悉SAS并开始学习R.对于SAS progremmers,有很多SAS程序可以通过组处理来实现,例如:

proc gam data = test;
by id;
model y = x1  x2 x3;
score data = test  out = pred;
run;

该SAS程序将为每个唯一的iD开发一个gam模型,并将它们应用于每个唯一ID的测试集。有R等价吗?

我非常感谢任何例子或想法?

非常感谢

1 个答案:

答案 0 :(得分:2)

假设您的sales数据集为3,000 * 300 = 900,000行且两个数据框都有customer_id列,您可以执行以下操作:

pred_groups <- split(seq_len(nrow(sales_score)), sales_score$customer_id)
# pred_groups is now a list, with names the customer_id's and each list
# element an integer vector of row numbers. Now iterate over each customer
# and make predictions on the training set.
preds <- unsplit(structure(lapply(names(pred_groups), function(customer_id) {
  # Train using only observations for this customer.
  # Note we are comparing character to integer but R's natural type
  # coercion should still give the correct answer.
  train_rows <- sales$customer_id == customer_id
  sales.rf <- randomForest(Sales ~ ., ntree = 500,
                           data = sales[train_rows, ],importance=TRUE)

  # Now make predictions only for this customer.
  predict(sales.rf, sales_score[pred_groups[[customer_id]], ])
}), .Names = names(pred_groups)), sales_score$customer_id)

print(head(preds)) # Should now be a vector of predicted scores of length
  # the number of rows in the train set.

编辑:根据@joran,这是一个for的解决方案:

pred_groups <- split(seq_len(nrow(sales_score)), sales_score$customer_id)
preds <- numeric(nrow(sales_score))
for(customer_id in names(pred_groups)) {
  train_rows <- sales$customer_id == customer_id
  sales.rf <- randomForest(Sales ~ ., ntree = 500,
                           data = sales[train_rows, ],importance=TRUE)
  pred_rows <- pred_groups[[customer_id]]
  preds[pred_rows] <- predict(sales.rf, sales_score[pred_rows, ])
})