我有一个用Apache Spark mllib构建的决策树。使用myDecisionTreeModel.toDebugString()的输出tree.txt看起来像这样:
#include<stdio.h>
#define sqr(a) a*a
void main()
{
int i;
i=64/sqr(4);
int j=64/16;
printf("%d\n",i);
printf("%d",j);
}
因此,我实现了以下python代码以使用graphviz进行渲染:
If (feature 36 <= 0.0)
If (feature 35 <= 5.0)
If (feature 42 <= 61.0)
If (feature 0 <= 3732128.0)
If (feature 23 <= 2.0)
Predict: 1.2779154046107128E-4
Else (feature 23 > 2.0)
Predict: 3.5523837168253053E-4
...
Else (feature 36 > 0.0)
If (feature 23 <= 2.0)
If (feature 41 <= 5.0)
etc.
这可能很幼稚,但我不知道该如何更有效地进行:基本上,我考虑了根据给定条件分割树的前导空格的数量。
输出看起来不错,但是很难预测左右孩子的位置是否正确。如果我将两行颠倒:
import graphviz
from graphviz import Digraph
dot = Digraph(comment='decision tree')
def parse(from_node, subtree):
if len(subtree) > 0:
root = subtree.pop(0).rstrip("\n")
if "Predict" in root:
dot.node(from_node, root)
else:
dot.node(from_node, root)
dot.edge(from_node, str(2*int(from_node) + 2))
dot.edge(from_node, str(2*int(from_node) + 1))
#split the tree into two halves
idx = 0
tgt = len(root)-len(root.lstrip())
for l in subtree:
if (len(l)-len(l.lstrip()) == tgt):
break
idx += 1
subtreeRight = subtree[:idx]
subtreeLeft = subtree[idx+1:]
parse(str(2*int(from_node) + 1), subtreeRight)
parse(str(2*int(from_node) + 2), subtreeLeft)
with open("tree.txt", "r") as f:
lines = f.readlines()
parse("0", lines)
似乎右边的孩子在左边,左边的孩子在右边。再加上树的平衡性不是很好。
有什么想法让它看起来更像决策树,并确定顺序吗?
谢谢