tl; dr：

Question

我试图在Python中实现this algorithm，但是由于我缺乏对树结构的了解，我对分区树的创建过程感到困惑。

简要说明：

链接的算法用于将高维特征空间划分为内部和叶节点，以便可以快速执行查询。

它使用特定的随机测试将大空间分割开来，超平面将一个大单元分裂为两个。

This answer explains everything much more precisely。

（摘自上面的链接）

代码片段：

def random_test(self, main_point):  # Main point is np.ndarray instance
    dimension = main_point.ravel().size
    random_coefficients = self.random_coefficients(dimension)
    scale_values = np.array(sorted([np.inner(random_coefficients, point.ravel())
                                    for point in self.points]))
    percentile = random.choice([np.percentile(scale_values, 100 * self.ratio),  # Just as described on Section 3.1
                                np.percentile(scale_values, 100 * (1 - self.ratio))])
    main_term = np.inner(main_point.ravel(), random_coefficients)
    if self.is_leaf():
        return 0  # Next node is the center leaf child
    else:
        if (main_term - percentile) >= 0:  # Hyper-plane equation defined in the document
            return -1  # Next node is the left child
        else:
            return 1  # Next node is the right child

如上面链接的算法中所述，

self.ratio正在确定树的平衡度和浅度，1/2应该生成最平衡和浅度的树。

然后，我们进入迭代部分，在该部分中树一直在不断地划分空间，直到它到达叶节点（注意关键字 reaches ）为止。是，它将永远不会真正到达叶节点。

因此，上面链接的文档中叶节点的定义是这样的：

def is_leaf(self):
    return (self.capacity * self.ratio) <= self.cell_count() <= self.capacity

其中self.cell_count()是单元格中的点数，self.capacity是单元格可以具有的最大点数，而self.ratio是拆分率。

My full code基本上应该通过在初始迭代时创建新节点来划分特征空间，直到创建叶节点为止（但从未创建叶节点）。 See the fragment that contains the division process。

（摘自上面链接的文档）

tl; dr：

在我们向它们添加任何点之前，是否准备好二进制分区树（填充有空节点）？如果是这样，我们是否不需要定义树的级别（深度）？

如果不是，是否在向它们添加点时创建了二进制分区树？如果是这样，那么（从第一次迭代开始）如何将第一点添加到树中？

Answer 1

它们是随您而建的。第一个节点在第1行的右边或左边。然后下一层在第2行的右边或左边...您所提供的论文插图显示了这一点，其中的行编号与为找到该节点而显示的选择相关。

当然，向右或向左不正确。有些线是水平切割的。但是创建的空间是二进制的。

Answer 2

我已经能够测试评论中提到的新方法，并且效果很好。

The algorithm that was linked above，隐式声明该点应单独放入分区树中，并通过所有随机测试并在落下时创建新节点。

但是此方法存在一个重大问题，因为为了拥有平衡的有效树和浅树，必须将左右节点均匀分布。

因此，为了拆分节点，在树的每个级别上，必须将节点的每个点传递到左节点或右节点（通过随机测试），直到树达到所有节点处的深度为止。那个水平是叶子。

用数学术语来说，根节点包含一个向量空间，该向量空间分为左右两个节点，两个节点包含凸多面体，这些凸多面体通过用分离的超平面支撑超平面来界定：

方程的负项（我相信我们可以称其为偏向），是分裂率开始起作用的地方，它应该是100 * r至100 *（1-r）之间所有节点点的百分位，这样树木分离得更均匀，也更浅。基本上，它决定了超平面分离的均匀程度，这就是为什么我们要求包含所有点的节点的原因。

我已经能够实现这样的系统：

def index_space(self):
    shuffled_space = self.shuffle_space()
    current_tree = PartitionTree()
    level = 0
    root_node = RootNode(shuffled_space, self.capacity, self.split_ratio, self.indices)
    current_tree.root_node = root_node
    current_tree.node_array.append(root_node)
    current_position = root_node.node_position
    node_array = {0: [root_node]}
    while True:
        current_nodes = node_array[level]
        if all([node.is_leaf() for node in current_nodes]):
            break
        else:
            level += 1
            node_array[level] = []
            for current_node in current_nodes:
                if not current_node.is_leaf():
                    left_child = InternalNode(self.capacity, self.split_ratio, self.indices,
                                              self._append(current_position, [-1]), current_node)
                    right_child = InternalNode(self.capacity, self.split_ratio, self.indices,
                                               self._append(current_position, [1]), current_node)
                    for point in current_node.list_points():
                        if current_node.random_test(point) == 1:
                            right_child.add_point(point)
                        else:
                            left_child.add_point(point)
                    node_array[level].extend([left_child, right_child])

其中node_array包含树的所有节点（根，内部和叶）。

不幸的是，node.random_test(x)方法：

def random_test(self, main_point):
    random_coefficients = self.random_coefficients()
    scale_values = [np.inner(self.random_coefficients(), point[:self.indices].ravel())
                                    for point in self.points]
    percentile = np.percentile(scale_values, self.ratio * 100)
    main_term = np.inner(main_point[:self.indices].ravel(), random_coefficients)
    if self.is_leaf():
        return 0  # Next node is the center leaf child
    else:
        if (main_term - percentile) >= 0:  # Hyper-plane equation defined in the document
            return -1  # Next node is the left child
        else:
            return 1  # Next node is the right child

效率低下，因为计算百分位数需要太多时间。因此，我必须找到另一种计算百分位数的方法（也许通过执行短路二进制搜索来优化百分位数）。

结论：

这只是克林顿·雷·穆里根（Clinton Ray Mulligan）答案的一个大扩展-简要解释了创建此类树的解决方案，因此将保留为可接受的答案。

我刚刚添加了更多详细信息，以防有人对实现随机二进制分区树感兴趣。

在将点添加到节点之前是否已预先准备好二进制分区树？

tl; dr：

2 个答案:

结论：