Spark scala - 将数组转换为Hierarchy类型表中同一个表的值

时间:2018-03-22 20:05:45

标签: scala apache-spark apache-spark-sql spark-dataframe

具有带树结构的层次结构数据模型的数据表。例如:以下是一个示例数据行:

Yajra\Datatables\DatatablesServiceProvider

我希望将这些行转换为展平版本,示例输出将为:

-------------------------------------------
Id | name    |parentId | path       | depth
-------------------------------------------
55 | Canada  | null    | null       | 0
77 | Ontario |  55     | /55        | 1
100| Toronto |  77     | /55/77     | 2
104| Brampton| 100     | /55/77/100 | 3

简单地说,如何生成PathFullNames,它来自与路径中的id匹配的相同表。所以在上面的例子中 / 55/77/100等于/加拿大/安大略省/多伦多

希望这是有道理的。

1 个答案:

答案 0 :(得分:1)

maybe this will help specifically with your problem:

You can create a dict from columns Id and name

// Generate a dict: Id -> name
val idMap = test.distinct.select($"Id", $"name").rdd.map(r => (r.getInt(0), r.getString(1))).collectAsMap

then define a UDF (user defined function) that will map the string

/55/77

to the string

Canada,Ontario

val pathMap = udf((p: String) => p.split("/").filter(_!="").map(id => idMap(id.toInt)).mkString(","))

finally, add a new column using this UDF and the path column

test.select(col("*"), when($"path".isNull, "None").otherwise(pathMap($"path")).as("pathNames")).show(false)

this gives you the dataframe you want:

+---+--------+--------+----------+-----+----------------------+
|Id |name    |parentId|path      |depth|pathNames             |
+---+--------+--------+----------+-----+----------------------+
|55 |Canada  |null    |null      |0    |None                  |
|77 |Ontario |55      |/55       |1    |Canada                |
|100|Toronto |77      |/55/77    |2    |Canada,Ontario        |
|104|Brampton|100     |/55/77/100|3    |Canada,Ontario,Toronto|
+---+--------+--------+----------+-----+----------------------+

Hope this will help you!

pd: Sorry for my english