我正在使用scikit“决策树” 分类器来预测迁移项目的“工作量”。我的要求的另一部分是找到影响预测的特征。
我训练了模型,并得到了具有不同节点上所有功能的层次树。
我认为提供测试记录时,将使用同一棵树来预测大小。但是事实并非如此!
进行预测后,我打印了Decision_path以查看“该预测中考虑的功能” 。
此决策路径与模型构建的树完全不同。
如果树不用于预测,树的用途是什么。
如何使用决策路径来获得该预测中的重要功能?
如果我导出这些规则集并用于查找决策路径,那将给我带来错误的功能或与决策路径的输出不匹配。
编辑1
添加了通用代码。它提供了类似的输出。
from __future__ import print_function
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import tree
# Create tree object
import graphviz
import pydotplus
import collections
file_path = "sample_data_generic.csv"
data = pd.read_csv( file_path )
data.head()
df = data.copy()
cols = df.columns
col_len = len(cols)
features_category = []
for col_index in range( col_len ):
if df[ cols[col_index] ].dtype == 'object' or df[ cols[col_index] ].dtype == 'float64':
df[ cols[col_index] ] = df[ cols[col_index] ].astype('category')
features_category.append( cols[col_index] )
#redefining the variable value as it is throwing some error in the below lines due to the presence of next line char?!
features_category = ['Cloud Provider', 'OS Upgrade Path', 'Target_OS_NAME', 'Target_OS_VERSION', 'os_version']
# create dataframe for target variable
df_target = df['Size']
df.drop('Size', axis=1, inplace=True)
df = pd.get_dummies(df, columns=features_category, dtype='int')
df.head()
df_x_data = df.copy()
df_x_data.head()
y_data = df_target
target_classes = y_data.unique()
target_classes = target_classes.astype('category')
test_size_val = 0.3
x_train, x_test, y_train, y_test = train_test_split(df_x_data, y_data, test_size=test_size_val, random_state=1)
print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])
x_train.sort_values(['Comps'], ascending=[True]) #, 'Estimation'
model = tree.DecisionTreeClassifier()
model = model.fit(x_train, y_train)
model.score(x_test, y_test)
dot_data = tree.export_graphviz(model, out_file=None,
feature_names=x_train.columns,
class_names=target_classes,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
print('graph: ', graph)
colors = ('white','red', 'green')
edges = collections.defaultdict(list)
for edge in graph.get_edge_list():
edges[edge.get_source()].append(int(edge.get_destination()))
print( edges )
for edge in edges:
edges[edge].sort()
for i in range(2):
dest = graph.get_node(str(edges[edge][i]))[0]
dest.set_fillcolor(colors[i])
graph.write_png('decision_tree_2019_generic.png')
from IPython.display import Image
Image(filename = 'decision_tree_2019_generic.png')
to_predict = x_test[3:4]
model.predict( to_predict )
to_predict.values
applied = model.apply( to_predict )
applied
to_predict
decision_path = model.decision_path( to_predict )
print( decision_path.indices, '\n' )
print( decision_path[:1][:1])
predict_cols = decision_path.indices
predicted_row = to_predict
cols = predicted_row.columns
#print("len of cols: ", len(cols) )
for col in predict_cols:
print( cols[col], predicted_row[ cols[col] ].values )
样本数据:目前是生成的数据。
云提供商,Comps,env,主机,操作系统升级路径,Target_OS_NAME,Target_OS_VERSION,大小,os_version AWS,11,2,3833,不直接,Linux,4,M,2 Google Cloud,16,6,4779,Direct,Mac,3,S,1 AWS,18,6,6677,不直接,Linux,7,S,8 Google Cloud,34,2,1650,Direct,Windows,5,B,1 AWS,35,6,9569,直接,Windows,6,M,3 AWS,36,6,7421,不直接,Windows,3,B,5 Google Cloud,49,4,3469,Direct,Mac,6,B,1 AWS,54,5,5677,Direct,Mac,4,M,8
但是预测的测试数据的决策路径是: 压缩[206]-> env [3]->主机[637]
预先感谢
答案 0 :(得分:3)
我认为您在误解decision_path
的返回值:它使用树的内部表示中的节点索引返回一个稀疏矩阵,指示预测所经过的树的哪些节点。这些并不意味着(并且实际上)不与数据集的列对齐。相反,如果要访问与预测所经过的节点相关的功能,请尝试:
predict_nodes = decision_path.indices
predicted_row = to_predict
cols = predicted_row.columns
for node in predict_nodes:
col = model.tree_.feature[node]
print( cols[col], predicted_row[ cols[col] ].values )
请注意,叶节点显然没有测试功能,并且(根据我的经验)返回的特征索引值为负值,因此也请当心。
要了解有关树的内部结构的更多信息,请参见this示例,并且(根据文档的建议)也使用help(sklearn.tree._tree.Tree)