在scikit-learn中实现R随机森林特征重要性得分

时间:2015-08-20 21:21:30

标签: python r scikit-learn regression random-forest

我试图在sklearn中实现R的随机森林回归模型的R特征重要性评分方法;根据R&#39的文档:

  

第一个度量是根据置换OOB数据计算的:对于每棵树,   记录数据袋外部分的预测误差   (分类错误率,回归MSE)。然后是一样的   置换每个预测变量后完成。和...之间的不同   然后将这两个树在所有树上取平均值,并将其归一化   差异的标准差。如果标准偏差为   对于变量,差异等于0,不进行除法   (但在这种情况下,平均值几乎总是等于0)。

因此,如果我理解正确,我需要能够为每个树中的OOB样本置换每个预测变量(特征)。

我知道我可以使用类似的东西访问训练有素的森林中的每棵树

numberTrees = 100
clf = RandomForestRegressor(n_estimators=numberTrees)
clf.fit(X,Y)
for tree in clf.estimators_:
    do something

是否有获取每棵树的OOB样本列表?也许我可以使用每棵树的random_state来得出OOB样本列表?

1 个答案:

答案 0 :(得分:3)

虽然R使用OOB样本,但我发现通过使用所有训练样本,我在scikit中得到了类似的结果。我正在做以下事情:

# permute training data and score against its own model  
epoch = 3
seeds = range(epoch)


scores = defaultdict(list) # {feature: change in R^2}

# repeat process several times and then average and then average the score for each feature
for j in xrange(epoch):
    clf = RandomForestRegressor(n_jobs = -1, n_estimators = trees, random_state = seeds[j],
                               max_features = num_features, min_samples_leaf = leaf)

    clf = clf.fit(X_train, y_train)
    acc = clf.score(X_train, y_train)    

    print 'Epoch', j
    # for each feature, permute its values and check the resulting score
    for i, col in enumerate(X_train.columns):
        if i % 200 == 0: print "- feature %s of %s permuted" %(i, X_train.shape[1])
        X_train_copy = X_train.copy()
        X_train_copy[col] = np.random.permutation(X_train[col])
        shuff_acc = clf.score(X_train_copy, y_train)
        scores[col].append((acc-shuff_acc)/acc)

# get mean across epochs
scores_mean = {k: np.mean(v) for k, v in scores.iteritems()}

# sort scores (best first)
scores_sorted = pd.DataFrame.from_dict(scores_mean, orient='index').sort(0, ascending = False)