Question

想象一下，我有这样一棵树：

- One
  - One one
  - One two
    - One two one
    - One two two
    - One two three
      - One two three one
  - One three
    - One three one
    - One three two
    - One three three
  - One four
  - One five

数据明智，它也很简单，只是一个孩子与父母的关系：

+-------------------+---------------+
|       Child       |    Parent     |
+-------------------+---------------+
| One               |               |
| One one           | One           |
| One two           | One           |
| One two one       | One two       |
| One two two       | One two       |
| One two three     | One two       |
| One two three one | One two three |
| One three         | One           |
| One three one     | One three     |
| One three two     | One three     |
| One three three   | One three     |
| One four          | One           |
| One five          | One           |
+-------------------+---------------+

现在我想做的是：

我已经列出了两个项目，让我们说One three three和One two three one
我想将其余的树父建立在根级别

在RDBMS中，我只是简单地使用CTE和UNION ALL编写递归查询，但是我无法在使用数据集或DataFrame的Spark中查找是否可能，可能是由于缺乏Scala / Python知识。任何帮助将不胜感激。

输出应如下：

- One
  - One two
    - One two three
      - One two three one
  - One three
    - One three three

Answer 1

您可以使用基于Graphx的解决方案来执行递归查询（父/子或分层查询）。这是许多数据库提供的功能，称为递归公用表表达式（CTE）或通过SQL子句连接

有关详细信息，请参阅此文章：https://www.qubole.com/blog/processing-hierarchical-data-using-spark-graphx-pregel-api/

使用Spark构建层次结构

1 个答案: