我试图弄清楚如何从随机森林中解释我的树木。我的数据包含大约29,000个观测值和35个特征。我粘贴了前22个观察,前11个特征以及我试图预测的特征(HighLowMobility)。
#loading the data into data frame
X = pd.read_csv('raw_data_for_edits.csv')
#Impute the missing values with median values,.
X = X.fillna(X.median())
#Dropping the categorical values
X = X.drop(['county_name','statename','stateabbrv'],axis=1)
#Collect the output in y variable
y = X['HighLowMobility']
X = X.drop(['HighLowMobility'],axis=1)
from sklearn.preprocessing import LabelEncoder
#Encoding the output labels
def preprocess_labels(y):
yp = []
#low = 0
#high = 0
for i in range(len(y)):
if (str(y[i]) =='Low'):
yp.append(0)
#low +=1
elif (str(y[i]) =='High'):
yp.append(1)
#high +=1
else:
yp.append(1)
return yp
#y = LabelEncoder().fit_transform(y)
yp = preprocess_labels(y)
yp = np.array(yp)
yp.shape
X.shape
from sklearn.cross_validation import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X,yp,test_size=0.25, random_state=42)
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)
training_data = X_train,y_train
test_data = X_test,y_test
dims = X_train.shape[1]
if __name__ == '__main__':
nn = Neural_Network([dims,10,5,1], learning_rate=1, C=1, opt=False, check_gradients=True, batch_size=200, epochs=100)
nn.fit(X_train,y_train)
weights = nn.final_weights()
testlabels_out = nn.predict(X_test)
print testlabels_out
print "Neural Net Accuracy is " + str(np.round(nn.score(X_test,y_test),2))
'''
RANDOM FOREST AND LOGISTIC REGRESSION
'''
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
clf1 = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None)
clf2 = RandomForestClassifier(n_estimators=100, max_depth=None,min_samples_split=1, random_state=0)
for clf, label in zip([clf1, clf2], ['Logistic Regression', 'Random Forest']):
scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
这是我的随机森林:
{{1}}
我如何解释我的树木?例如,perm_res_p25_c1823是一个特征,表明大学出生率为18-23岁的孩子出生在第25百分位,perm_res_p75_c1823代表第75百分位,HighLowMobility特征说明是否有高或低向上收入流动性。那么如何显示以下内容: "如果这个人来自第25个百分点并且生活在阿拉巴马州的Autauga,那么他们可能会有更低的向上流动性" ?
答案 0 :(得分:2)
你无法用这样的术语来解释RF,因为随机森林不能以这种方式工作。它创建了高度随机化的树集合,可以有各种决策规则。一旦你从完全可解释的决策树转到RF,你就会失去分类器的这个方面。 RF是黑盒子。您可以执行许多不同的近似值和估算,但它们会有效地忽略/替换您的RF。