基于阈值的聚类

时间:2015-10-06 21:59:49

标签: python tree hierarchy

编辑(简体)

我很确定,我错过了“谷歌”这个问题的正确条款,如果之前有人问过,请指点我: 我有一个树形结构,如下所示

(0)->(0,0:7)
     (0,1:9)
(1)->(1,0:6)
     (1,1:2)
     (1,2:1)

为简单起见,我们将其转换为扁平结构

l1, l2, v1
0, 0, 7
0, 1, 9
1, 0, 6
1, 1, 2
1, 2, 1

现在让我们在这棵树上加上3的阈值。这意味着我们希望保持高于阈值的节点并合并分支中低于阈值的所有节点。

所以我们最终得到的是,最后两行(因为它们低于阈值)最终被一个音符'屈服'作为两个总和:

l1, l2, v1
0, 0, 7
0, 1, 9
1, 0, 6
1, (1,2),  3

理想情况下,python中的解决方案将受到赞赏。显然,我很乐意处理边缘条件。请注意,实际上我最终可以得到6深的树。

1 个答案:

答案 0 :(得分:0)

所以我最后以我早先暗示过的繁琐方式做到了。 我开始在节点定义中添加一些标志(dodelete=Falsevisited=False)。

并将add_node方法更新为

def add_child(self, node):
    node.parent = self
    node.level = self.level + 1
    self.children.append(node)
    return node

其中self.children是节点列表

然后是两种方法

def collapse_nodes(tree, thresh=3):
    for n in tree:
        if n.dodelete:
            continue
        sub_tree = tree.get_by_path (n.id)
        sub_tree_stack = []
        for child in sub_tree.children:
           if child.val is not None and thresh > child.val:
               sub_tree_stack.append(child)
               tree.get_by_path(child.id).dodelete = True
           if sub_tree_stack:
              sub_tree.add_child(Node(",".join([subnode.name for subnode in sub_tree_stack]), 
                                 val = sum([subnode.val for subnode in sub_tree_stack]), 
                                 id= sorted([subnode.id for subnode in sub_tree_stack])[0]))
    return tree

def roll_up(tree, thresh = 2, level=5):
    for n in tree:
        if n.dodelete or n.visited:
            continue
        if n.level != level:
            continue
        sub_tree = tree.get_by_path(n.id)
        if sub_tree is None:
            continue
        sub_tree_stack = []
        for child in tree.get_by_path(n.id).children:
            if child.val is not None and child.val <= thresh:
                sub_tree_stack.append((child.name, child.id, child.val))
                # Also mark this for deletion
                tree.get_by_path(child.id).dodelete = True
        if sub_tree_stack:
            # Get the parent for these nodes
            node_name = n.name + ": [" + ",".join([subnode[0] for subnode in sub_tree_stack]) + "]"
            node_val = sum([subnode[2]for subnode in sub_tree_stack])
            node_id = sorted([subnode[1] for subnode in sub_tree_stack])[0]
            parent_name = n.parent.name
            parent_level = n.parent.level
            parent_id= n.parent.id
            # Now ensure that you delete the old node before adding new
            tree.get_by_path(n.id).dodelete = True

            tree.get_by_path(parent_id).add_child(Node(node_name, val=node_val, id = n.id, visited=True) )
    return tree

有点复杂的做法,但有效。我已经为任何一个试图测试它的任性的灵魂创造了一个要点https://gist.github.com/fahaddaniyal/0dc86c80f266fd9f8cdb