使用Spark构建层次结构

时间:2017-06-01 11:18:43

标签: apache-spark

想象一下,我有这样一棵树:

- One
  - One one
  - One two
    - One two one
    - One two two
    - One two three
      - One two three one
  - One three
    - One three one
    - One three two
    - One three three
  - One four
  - One five

数据明智,它也很简单,只是一个孩子与父母的关系:

+-------------------+---------------+
|       Child       |    Parent     |
+-------------------+---------------+
| One               |               |
| One one           | One           |
| One two           | One           |
| One two one       | One two       |
| One two two       | One two       |
| One two three     | One two       |
| One two three one | One two three |
| One three         | One           |
| One three one     | One three     |
| One three two     | One three     |
| One three three   | One three     |
| One four          | One           |
| One five          | One           |
+-------------------+---------------+

现在我想做的是:

  • 我已经列出了两个项目,让我们说One three threeOne two three one
  • 我想将其余的树父建立在根级别

在RDBMS中,我只是简单地使用CTE和UNION ALL编写递归查询,但是我无法在使用数据集或DataFrame的Spark中查找是否可能,可能是由于缺乏Scala / Python知识。任何帮助将不胜感激。

输出应如下:

- One
  - One two
    - One two three
      - One two three one
  - One three
    - One three three

1 个答案:

答案 0 :(得分:1)

您可以使用基于Graphx的解决方案来执行递归查询(父/子或分层查询)。这是许多数据库提供的功能,称为递归公用表表达式(CTE)或通过SQL子句连接

有关详细信息,请参阅此文章:https://www.qubole.com/blog/processing-hierarchical-data-using-spark-graphx-pregel-api/