Question

我有一个用Apache Spark mllib构建的决策树。使用myDecisionTreeModel.toDebugString（）的输出tree.txt看起来像这样：

#include<stdio.h>
#define sqr(a) a*a
void main()
{
    int i;
    i=64/sqr(4);
    int j=64/16;
    printf("%d\n",i);
    printf("%d",j);
}

因此，我实现了以下python代码以使用graphviz进行渲染：

If (feature 36 <= 0.0)
 If (feature 35 <= 5.0)
  If (feature 42 <= 61.0)
   If (feature 0 <= 3732128.0)
    If (feature 23 <= 2.0)
     Predict: 1.2779154046107128E-4
    Else (feature 23 > 2.0)
     Predict: 3.5523837168253053E-4
...
Else (feature 36 > 0.0)
 If (feature 23 <= 2.0)
  If (feature 41 <= 5.0)
etc.

这可能很幼稚，但我不知道该如何更有效地进行：基本上，我考虑了根据给定条件分割树的前导空格的数量。

输出看起来不错，但是很难预测左右孩子的位置是否正确。如果我将两行颠倒：

import graphviz
from graphviz import Digraph

dot = Digraph(comment='decision tree')

def parse(from_node, subtree):
    if len(subtree) > 0:
        root = subtree.pop(0).rstrip("\n") 
        if "Predict" in root:
            dot.node(from_node, root)
        else:
            dot.node(from_node, root)
            dot.edge(from_node, str(2*int(from_node) + 2))
            dot.edge(from_node, str(2*int(from_node) + 1))
            #split the tree into two halves
            idx = 0
            tgt = len(root)-len(root.lstrip())
            for l in subtree:
                if (len(l)-len(l.lstrip()) == tgt):
                    break
                idx += 1
            subtreeRight = subtree[:idx]
            subtreeLeft = subtree[idx+1:]

            parse(str(2*int(from_node) + 1), subtreeRight)
            parse(str(2*int(from_node) + 2), subtreeLeft)


with open("tree.txt", "r") as f:
    lines = f.readlines()
    parse("0", lines)

似乎右边的孩子在左边，左边的孩子在右边。再加上树的平衡性不是很好。

有什么想法让它看起来更像决策树，并确定顺序吗？

谢谢

将Spark mllib决策树转换为graphviz

0 个答案: