为什么我的随机森林的性能比决策树差

时间:2017-10-24 06:32:13

标签: scikit-learn random-forest

这是我的第一次随机森林练习,遗憾的是它的性能比单一的决策树差。我一直在研究这个问题,但未能弄清楚问题出在哪里。以下是一些运行记录。 (我很抱歉发布完整的代码。)

   
Sklearn Decision Tree Classifier 0.714285714286
Sklearn Random Forest Classifier 0.714285714286
My home made Random Forest Classifier 0.628571428571

Sklearn Decision Tree Classifier 0.642857142857
Sklearn Random Forest Classifier 0.814285714286
My home made Random Forest Classifier 0.571428571429

Sklearn Decision Tree Classifier 0.757142857143
Sklearn Random Forest Classifier 0.771428571429
My home made Random Forest Classifier 0.585714285714

我使用来自this的声纳数据集,(声纳,+ Mines + vs. + Rocks)因为它有大约60个特征。

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# section 1: read data, shuffle, change label from string to float
filename = "sonar_all_data.csv"
colnames = ['c'+str(i) for i in range(60)]
colnames.append('type')
df = pd.read_csv(filename, index_col=None, header=None, names=colnames)
df = df.sample(frac=1).reset_index(drop=True)
df['lbl'] = 1.0
df.loc[df['type']=='R', 'lbl'] = 0.0
df.drop('type', axis=1, inplace=True)
df.astype(np.float32, inplace=True)
feature_names = ['c' + str(i) for i in range(60)]
label_name =['lbl']

# section 2: prep train and test data
test_x = df[:70][feature_names].get_values()
test_y = df[:70][label_name].get_values().ravel()
train_x = df[70:][feature_names].get_values()
train_y = df[70:][label_name].get_values().ravel()

# section 3: take a look at performance of sklearn decision tree and randomforest
clf = DecisionTreeClassifier()
clf.fit(train_x, train_y)
print("Sklearn Decision Tree Classifier", clf.score(test_x, test_y))

rfclf = RandomForestClassifier(n_jobs=2)
rfclf.fit(train_x, train_y)
print("Sklearn Random Forest Classifier", rfclf.score(test_x, test_y))


# section 4: my first practice of random forest
m = 10
votes = [1/m] * m
num_train = len(train_x)
num_feat = len(train_x[0])


n = int(num_train * 0.6)
k = int(np.sqrt(num_feat))

index_of_train_data = np.arange(num_train)
index_of_train_feat = np.arange(num_feat)

clfs = [DecisionTreeClassifier() for _ in range(m)]
feats = []

for i, xclf in enumerate(clfs):
    np.random.shuffle(index_of_train_data)
    np.random.shuffle(index_of_train_feat)
    row_idx = index_of_train_data[:n]
    feat_idx = index_of_train_feat[:k]
    sub_train_x = train_x[row_idx,:][:, feat_idx]
    sub_train_y = train_y[row_idx]
    xclf.fit(sub_train_x, sub_train_y)
    feats.append(feat_idx)

pred = np.zeros(test_y.shape)

for clf, feat, vote in zip(clfs, feats, votes):
    pred += clf.predict(test_x[:, feat]) * vote

pred[pred  > 0.5] = 1.0
pred[pred <= 0.5] = 0.0
print("My home made Random Forest Classifier", sum(pred==test_y)/len(test_y))

1 个答案:

答案 0 :(得分:1)

正如chrisckwong821所说的那样,你过度拟合:如果你构建一个太深的随机森林,它将看起来像你的训练数据,并将严重预测新的(测试)数据。