Scala spark - 处理层次结构数据表

时间:2018-03-19 19:54:46

标签: scala apache-spark apache-spark-sql spark-dataframe spark-streaming

我有带有树结构的层次结构数据模型的数据表。 例如: 这是一个示例数据行:

-------------------------------------------
Id | name    |parentId | path       | depth
-------------------------------------------
55 | Canada  | null    | null       | 0
77 | Ontario |  55     | /55        | 1
100| Toronto |  77     | /55/77     | 2
104| Brampton| 100     | /55/77/100 | 3

我希望将这些行转换为展平版本,示例输出将为:

-----------------------------------
Id | name     | parentId | depth
------------------------------------
104| Brampton | Toronto  | 3
100| Toronto  | Ontario  | 2
77 | Ontario  | Canada   | 1
55 | Canada   | None     | 0
100| Toronto  | Ontario  | 2
77 | Ontario  | Canada   | 1
55 | Canada   | None     | 0
77 | Ontario  | Canada   | 1
55 | Canada   | None     | 0
55 | Canada   | None     | 0

我尝试使用笛卡尔式或者喜欢n2搜索,但它们都没有工作。

3 个答案:

答案 0 :(得分:0)

以下是一种方式:

//Creating DF with your data
def getSeq(s:String): Seq[String] = { s.split('|').map(_.trim).toSeq }
var l = getSeq("77 | Ontario |  55     | /55        | 1") :: Nil
l :+= getSeq("55 | Canada  | null    | null       | 0")
l :+= getSeq("100| Toronto |  77     | /55/77     | 2")
l :+= getSeq("104| Brampton| 100     | /55/77/100 | 3")
val df = l.map(x => x match { case Seq(a,b,c,d,e) => (a,b,c,d,e) }).toDF("Id", "name", "parentId", "path", "depth")

//original DF with parentName using a self join
val dfWithPar = df.as("df1").join(df.as("df2"), $"df1.parentId" === $"df2.Id", "leftouter").select($"df1.Id",$"df1.name",$"df1.parentId",$"df1.path",$"df1.depth",$"df2.name".as("parentName"))

// Split path as per requirement and get the exploded DF
val dfExploded = dfWithPar.withColumn("path", regexp_replace($"path", "^/", "")).withColumn("path", split($"path","/")).withColumn("path", explode($"path"))

//Join orig with exploded to get addendum of rows as per individual path placeholders
val dfJoined = dfWithPar.join(dfExploded, dfWithPar.col("Id") === dfExploded.col("path")).select(dfWithPar.col("Id"), dfWithPar.col("name"), dfWithPar.col("parentId"), dfWithPar.col("path"), dfWithPar.col("depth"), dfWithPar.col("parentName"))

//Get the final result by adding the addendum to orig
dfWithPar.union(dfJoined).select($"Id", $"name", $"parentName", $"depth").show

+---+--------+----------+-----+
| Id|    name|parentName|depth|
+---+--------+----------+-----+
| 77| Ontario|    Canada|    1|
| 55|  Canada|      null|    0|
|100| Toronto|   Ontario|    2|
|104|Brampton|   Toronto|    3|
| 77| Ontario|    Canada|    1|
| 77| Ontario|    Canada|    1|
| 55|  Canada|      null|    0|
| 55|  Canada|      null|    0|
| 55|  Canada|      null|    0|
|100| Toronto|   Ontario|    2|
+---+--------+----------+-----+

答案 1 :(得分:0)

自我加入条件选择合适的列应该适合您。

解决方案有点棘手,因为您需要查找路径列中的每个父名称,包括需要concat_wssplit和{{1}的papentId列 } 内置函数。该流程的其余部分为explodejoinsselects

给定数据框:

fills

您可以生成最终加入的临时数据框

+---+--------+--------+----------+-----+
|Id |name    |parentId|path      |depth|
+---+--------+--------+----------+-----+
|55 |Canada  |null    |null      |0    |
|77 |Ontario |55      |/55       |1    |
|100|Toronto |77      |/55/77    |2    |
|104|Brampton|100     |/55/77/100|3    |
+---+--------+--------+----------+-----+

通过执行

可以实现所需的数据帧
val df2 = df.as("table1")
  .join(df.as("table2"), col("table1.parentId") === col("table2.Id"), "left")
  .select(col("table1.Id").as("path"), col("table1.name").as("name"), col("table2.name").as("parentId"), col("table1.depth").as("depth"))
  .na.fill("None")
//    +----+--------+--------+-----+
//    |path|name    |parentId|depth|
//    +----+--------+--------+-----+
//    |55  |Canada  |None    |0    |
//    |77  |Ontario |Canada  |1    |
//    |100 |Toronto |Ontario |2    |
//    |104 |Brampton|Toronto |3    |
//    +----+--------+--------+-----+

<强>解释

df.withColumn("path", explode(split(concat_ws("", col("parentId"), col("path")), "/"))) .as("table1") .join(df2.as("table2"), Seq("path"), "right") .select(col("table2.path").as("Id"), col("table2.name").as("name"), col("table2.parentId").as("parentId"), col("table2.depth").as("depth")) .na.fill("0") .show(false) // +---+--------+--------+-----+ // |Id |name |parentId|depth| // +---+--------+--------+-----+ // |55 |Canada |None |0 | // |55 |Canada |None |0 | // |55 |Canada |None |0 | // |55 |Canada |None |0 | // |77 |Ontario |Canada |1 | // |77 |Ontario |Canada |1 | // |77 |Ontario |Canada |1 | // |100|Toronto |Ontario |2 | // |100|Toronto |Ontario |2 | // |104|Brampton|Toronto |3 | // +---+--------+--------+-----+
|104|Brampton|100 |/55/77/100|3 |会生成concat_ws("", col("parentId"), col("path")),因为您可以在正面看到 100正在连接 |104|Brampton|100 |100/55/77/100|3 |生成数组列split(concat_ws("", col("parentId"), col("path")), "/")
并且|104|Brampton|100 |[100, 55, 77, 100]|3 |作为一个整体将将数组列分解为单独的行

explode(split(concat_ws("", col("parentId"), col("path")), "/"))

|104|Brampton|100 |100 |3 | |104|Brampton|100 |55 |3 | |104|Brampton|100 |77 |3 | |104|Brampton|100 |100 |3 | 更清楚地理解哪些不需要解释;)

我希望答案很有帮助

答案 2 :(得分:0)

这是另一个版本:

val sparkConf = new SparkConf().setAppName("pathtest").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._

var dfA = spark.createDataset(Seq(
  (55, "Canada", -1, "", 0),
  (77, "Ontario", 55, "/55", 1),
  (100, "Toronto", 77, "/55/77", 2),
  (104, "Brampton", 100, "/55/77/100", 3))
)
.toDF("Id", "name", "parentId", "path", "depth")


def getArray = udf((path: String) => {
  if (path.contains("/"))
    path.split("/")
  else
    Array[String](null)
})

val dfB = dfA
    .withColumn("path", getArray(col("path")))
    .withColumn("path", explode(col("path")))
    .toDF()

dfB.as("B").join(dfA.as("A"), $"B.parentId" === $"A.Id", "left")
  .select($"B.Id".as("Id"), $"B.name".as("name"), $"A.name".as("parent"), $"B.depth".as("depth"))
    .show()

我有2个数据帧dfA和dfB,它是从第一个产生的。通过爆炸路径数组,用udf生成dfB。请注意,加拿大的技巧是返回一个空数组,否则爆炸不会生成一行。

dfB看起来像这样:

+---+--------+--------+----+-----+
| Id|    name|parentId|path|depth|
+---+--------+--------+----+-----+
| 55|  Canada|      -1|null|    0|
| 77| Ontario|      55|    |    1|
| 77| Ontario|      55|  55|    1|
|100| Toronto|      77|    |    2|
|100| Toronto|      77|  55|    2|
|100| Toronto|      77|  77|    2|
|104|Brampton|     100|    |    3|
|104|Brampton|     100|  55|    3|
|104|Brampton|     100|  77|    3|
|104|Brampton|     100| 100|    3|
+---+--------+--------+----+-----+ 

最后的结果如下:

+---+--------+-------+-----+
| Id|    name| parent|depth|
+---+--------+-------+-----+
| 55|  Canada|   null|    0|
| 77| Ontario| Canada|    1|
| 77| Ontario| Canada|    1|
|100| Toronto|Ontario|    2|
|100| Toronto|Ontario|    2|
|100| Toronto|Ontario|    2|
|104|Brampton|Toronto|    3|
|104|Brampton|Toronto|    3|
|104|Brampton|Toronto|    3|
|104|Brampton|Toronto|    3|
+---+--------+-------+-----+