想象一下,我有这样一棵树:
- One
- One one
- One two
- One two one
- One two two
- One two three
- One two three one
- One three
- One three one
- One three two
- One three three
- One four
- One five
数据明智,它也很简单,只是一个孩子与父母的关系:
+-------------------+---------------+
| Child | Parent |
+-------------------+---------------+
| One | |
| One one | One |
| One two | One |
| One two one | One two |
| One two two | One two |
| One two three | One two |
| One two three one | One two three |
| One three | One |
| One three one | One three |
| One three two | One three |
| One three three | One three |
| One four | One |
| One five | One |
+-------------------+---------------+
现在我想做的是:
One three three
和One two three one
在RDBMS中,我只是简单地使用CTE和UNION ALL编写递归查询,但是我无法在使用数据集或DataFrame的Spark中查找是否可能,可能是由于缺乏Scala / Python知识。任何帮助将不胜感激。
输出应如下:
- One
- One two
- One two three
- One two three one
- One three
- One three three
答案 0 :(得分:1)
您可以使用基于Graphx的解决方案来执行递归查询(父/子或分层查询)。这是许多数据库提供的功能,称为递归公用表表达式(CTE)或通过SQL子句连接
有关详细信息,请参阅此文章:https://www.qubole.com/blog/processing-hierarchical-data-using-spark-graphx-pregel-api/