DecisionTreeClassifier中两片叶子之间的距离

时间:2018-12-04 17:41:31

标签: python scikit-learn random-forest

有没有一种方法可以计算decision tree中两片叶子之间的距离。

按距离,我的意思是从一个叶子到另一个叶子的节点数。

graph

例如,在此示例图中:

distance(leaf1, leaf2) == 1
distance(leaf1, leaf3) == 3
distance(leaf1, leaf4) == 4

感谢您的帮助!

1 个答案:

答案 0 :(得分:5)

一个依赖于其他Python软件包(即networkxpydot)的示例。因此,对该解决方案进行了慷慨的评论。这个问题用scikit-learn标记,因此解决方案以Python表示。

一些数据和通用DecisionTreeClassifier

# load example data and classifier
from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# for determining distance
from sklearn import tree
import networkx as nx
import pydot

# load data and fit a DecisionTreeClassifier
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train);

此函数使用tree.export_graphvizpydot.graph_from_dot_datanx.drawing.nx_pydot.from_pdyotnx.to_undirected将拟合度DecisionTreeClassifier转换为无指向的networkx MultiGraph

def dt_to_mg(clf):
    """convert a fit DecisionTreeClassifier to a Networkx undirected MultiGraph"""
    # export the classifier to a string DOT format
    dot_data = tree.export_graphviz(clf)
    # Use pydot to convert the dot data to a graph
    dot_graph = pydot.graph_from_dot_data(dot_data)[0]
    # Import the graph data into Networkx 
    MG = nx.drawing.nx_pydot.from_pydot(dot_graph)
    # Convert the tree to an undirected Networkx Graph
    uMG = MG.to_undirected()
    return uMG

uMG = dt_to_mg(clf)

使用nx.shortest_path_length查找树中任意两个节点之间的距离

# get leaves
leaves = set(str(x) for x in clf.apply(X))
print(leaves)
{'10', '7', '9', '5', '3', '4'}

# find the distance for two leaves
print(nx.shortest_path_length(uMG, source='9', target='5'))
5

# undirected graph means this should also work
print(nx.shortest_path_length(uMG, source='5', target='9'))
5

shortest_path_length返回sourcetarget之间的边数。这不是距离度量OP所要求的。我认为它们之间的节点数仅为n_edges - 1

print(nx.shortest_path_length(uMG, source='5', target='9') - 1)
4

或者找到所有叶子的距离并将它们存储在字典或其他有用的对象中以进行下游计算。

from itertools import combinations
leaf_distance_edges = {}
leaf_distance_nodes = {}
for leaf1, leaf2 in combinations(leaves, 2):
    d = nx.shortest_path_length(uMG, source=leaf1, target=leaf2)
    leaf_distance_edges[(leaf1, leaf2)] = d
    leaf_distance_nodes[(leaf1, leaf2)] = d - 1 

leaf_distance_nodes
{('4', '9'): 5,
 ('4', '5'): 2,
 ('4', '10'): 5,
 ('4', '7'): 4,
 ('4', '3'): 1,
 ('9', '5'): 4,
 ('9', '10'): 1,
 ('9', '7'): 2,
 ('9', '3'): 5,
 ('5', '10'): 4,
 ('5', '7'): 3,
 ('5', '3'): 2,
 ('10', '7'): 2,
 ('10', '3'): 5,
 ('7', '3'): 4}