有没有一种方法可以计算decision tree中两片叶子之间的距离。
按距离,我的意思是从一个叶子到另一个叶子的节点数。
例如,在此示例图中:
distance(leaf1, leaf2) == 1
distance(leaf1, leaf3) == 3
distance(leaf1, leaf4) == 4
感谢您的帮助!
答案 0 :(得分:5)
一个依赖于其他Python软件包(即networkx和pydot)的示例。因此,对该解决方案进行了慷慨的评论。这个问题用scikit-learn
标记,因此解决方案以Python表示。
一些数据和通用DecisionTreeClassifier
:
# load example data and classifier
from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# for determining distance
from sklearn import tree
import networkx as nx
import pydot
# load data and fit a DecisionTreeClassifier
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train);
此函数使用tree.export_graphviz
,pydot.graph_from_dot_data
,nx.drawing.nx_pydot.from_pdyot
和nx.to_undirected
将拟合度DecisionTreeClassifier
转换为无指向的networkx MultiGraph
。
def dt_to_mg(clf):
"""convert a fit DecisionTreeClassifier to a Networkx undirected MultiGraph"""
# export the classifier to a string DOT format
dot_data = tree.export_graphviz(clf)
# Use pydot to convert the dot data to a graph
dot_graph = pydot.graph_from_dot_data(dot_data)[0]
# Import the graph data into Networkx
MG = nx.drawing.nx_pydot.from_pydot(dot_graph)
# Convert the tree to an undirected Networkx Graph
uMG = MG.to_undirected()
return uMG
uMG = dt_to_mg(clf)
使用nx.shortest_path_length
查找树中任意两个节点之间的距离。
# get leaves
leaves = set(str(x) for x in clf.apply(X))
print(leaves)
{'10', '7', '9', '5', '3', '4'}
# find the distance for two leaves
print(nx.shortest_path_length(uMG, source='9', target='5'))
5
# undirected graph means this should also work
print(nx.shortest_path_length(uMG, source='5', target='9'))
5
shortest_path_length
返回source
和target
之间的边数。这不是距离度量OP所要求的。我认为它们之间的节点数仅为n_edges - 1
。
print(nx.shortest_path_length(uMG, source='5', target='9') - 1)
4
或者找到所有叶子的距离并将它们存储在字典或其他有用的对象中以进行下游计算。
from itertools import combinations
leaf_distance_edges = {}
leaf_distance_nodes = {}
for leaf1, leaf2 in combinations(leaves, 2):
d = nx.shortest_path_length(uMG, source=leaf1, target=leaf2)
leaf_distance_edges[(leaf1, leaf2)] = d
leaf_distance_nodes[(leaf1, leaf2)] = d - 1
leaf_distance_nodes
{('4', '9'): 5,
('4', '5'): 2,
('4', '10'): 5,
('4', '7'): 4,
('4', '3'): 1,
('9', '5'): 4,
('9', '10'): 1,
('9', '7'): 2,
('9', '3'): 5,
('5', '10'): 4,
('5', '7'): 3,
('5', '3'): 2,
('10', '7'): 2,
('10', '3'): 5,
('7', '3'): 4}