如何通过python解释随机森林中的树木

时间:2016-06-27 00:40:33

标签: python machine-learning random-forest

我试图弄清楚如何从随机森林中解释我的树木。我的数据包含大约29,000个观测值和35个特征。我粘贴了前22个观察,前11个特征以及我试图预测的特征(HighLowMobility)。

   #loading the data into data frame
   X = pd.read_csv('raw_data_for_edits.csv')
   #Impute the missing values with median values,.
   X = X.fillna(X.median())

  #Dropping the categorical values
  X = X.drop(['county_name','statename','stateabbrv'],axis=1)

  #Collect the output in y variable
  y = X['HighLowMobility']


  X = X.drop(['HighLowMobility'],axis=1)


 from sklearn.preprocessing import LabelEncoder

 #Encoding the output labels
 def preprocess_labels(y):
   yp = []
   #low = 0
   #high = 0
    for i in range(len(y)):
      if (str(y[i]) =='Low'):
         yp.append(0)
         #low +=1
     elif (str(y[i]) =='High'):
         yp.append(1)
         #high +=1
      else:
         yp.append(1)
      return yp



  #y = LabelEncoder().fit_transform(y)
  yp = preprocess_labels(y)
  yp = np.array(yp)
  yp.shape
  X.shape
  from sklearn.cross_validation import train_test_split
  X_train, X_test,y_train, y_test = train_test_split(X,yp,test_size=0.25, random_state=42)
  X_train = np.array(X_train)
  y_train = np.array(y_train)
  X_test = np.array(X_test)
  y_test = np.array(y_test)
  training_data = X_train,y_train
  test_data = X_test,y_test
  dims = X_train.shape[1]
   if __name__ == '__main__':
     nn = Neural_Network([dims,10,5,1], learning_rate=1, C=1, opt=False, check_gradients=True, batch_size=200, epochs=100)
     nn.fit(X_train,y_train) 
     weights = nn.final_weights()
     testlabels_out = nn.predict(X_test)
     print testlabels_out
     print "Neural Net Accuracy is " + str(np.round(nn.score(X_test,y_test),2))


  '''
  RANDOM FOREST AND LOGISTIC REGRESSION
  '''
  from sklearn import cross_validation
  from sklearn.linear_model import LogisticRegression
  from sklearn.ensemble import RandomForestClassifier
  clf1 = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0,       fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None)
  clf2 = RandomForestClassifier(n_estimators=100, max_depth=None,min_samples_split=1, random_state=0)
   for clf, label in zip([clf1, clf2], ['Logistic Regression', 'Random Forest']):
   scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

这是我的随机森林:

{{1}}

我如何解释我的树木?例如,perm_res_p25_c1823是一个特征,表明大学出生率为18-23岁的孩子出生在第25百分位,perm_res_p75_c1823代表第75百分位,HighLowMobility特征说明是否有高或低向上收入流动性。那么如何显示以下内容: "如果这个人来自第25个百分点并且生活在阿拉巴马州的Autauga,那么他们可能会有更低的向上流动性" ?

1 个答案:

答案 0 :(得分:2)

你无法用这样的术语来解释RF,因为随机森林不能以这种方式工作。它创建了高度随机化的树集合,可以有各种决策规则。一旦你从完全可解释的决策树转到RF,你就会失去分类器的这个方面。 RF是黑盒子。您可以执行许多不同的近似值和估算,但它们会有效地忽略/替换您的RF。