indexing - Adding PH-Tree to ELKI

I'm considering adding the PH-tree to ELKI. I couldn't find any tutorials for examples for that, and the internal architecture is not fully obvious to me at the moment.

Do you think it makes sense to add the PH-tree to ELKI?
How much effort would that be?
Could I get some help?
Does it make sense to implement only an in-memory version, as done for the kd-tree (as far as I understand)?

Some context: The PH-tree is a spatial index that was published at SIGMOD'14: paper, Java source code is available here. It is a bit similar to a quadtree, but much more space efficient, doesn't require rebalancing and scales quite well with dimensionality. What makes the PH-tree different from the R*-Tree implementations is that there is no concept of leaf/inner nodes, and nodes don't will not directly map to pages. It also works quite well with random insert/delete (no bulk-loading required).

是

当然，在ELKI中使用PH树是很好的，允许其他人试验它。我们希望ELKI成为一个综合工具;它有R树，M树，k-d树，覆盖树，LSH，iDistance，倒排列表，空间填充曲线，PINN，......;有工作但没有清理的X-tree，rank-cover-trees，bond等实现。

我们希望研究人员能够轻松地研究哪种指数最适合他们的数据，当然，拥有PH-tree也会很好。我们还试图突破这些指数的极限，例如：当支持其他距离测量而不是欧几里德距离时。

努力取决于您对编码的经验; ELKI使用了一些优化良好的数据结构，但这意味着由于性能原因，我们没有在许多地方使用标准Java API。例如，添加封面树花了我大约一天的工作（并且它表现非常好）。我假设一个更灵活（但也更内存密集）的k-d树将是类似的工作量。我没有详细研究过PH树，但是我认为它比这更加努力。我的胆量也说它不会像宣传的那样快。它似乎是一个前缀压缩的四叉树。在我的实验中，诸如希尔伯特曲线所需的比特交织方法可能非常昂贵。它也可能仅适用于Minkowski指标。但是欢迎你证明我错了。 ; - ）

欢迎您在邮件列表或此处获得帮助。

我首先会做一个内存变量，以完全理解索引。然后对其进行基准测试以确定优化潜力，并进行调试。在此之前，您可能还没有找到所有极端情况，例如重复点处理，退化数据集等。

始终制作磁盘可选。如果您的数据适合内存，则仅内存实现将比任何磁盘版本快得多。

在为ELKI做出贡献时，请：

避免外部依赖。我们在质量方面遇到了不好的经历。 Apache Commons，我们希望这个软件包易于安装和维护，因此我们希望将.jar依赖项保持在最低限度（同样，大量具有冗余功能的jar也会降低性能）。我倾向于只接受可选扩展模块的外部依赖项。
不要复制其他来源的代码。 ELKI获得AGPL-3许可，对ELKI本身的任何贡献也应获得AGPL-3许可。在某些情况下，可以包括例如公共域代码，但我们需要将这些保持在最低限度。我们可能使用 Apache许可代码（在外部库中），但不应该混合它们。因此，快速浏览一下，不允许将源代码复制到ELKI中。

如果您正在寻找数据挖掘项目创意，以下是我们希望看到的对ELKI做出贡献的文章/算法列表（我们将此列表保持为学生实施项目的最新版本）：

http://elki.dbs.ifi.lmu.de/wiki/ProjectIdeas

Adding PH-Tree to ELKI

1 个答案:

是