我有带有树结构的层次结构数据模型的数据表。 例如: 这是一个示例数据行:
-------------------------------------------
Id | name |parentId | path | depth
-------------------------------------------
55 | Canada | null | null | 0
77 | Ontario | 55 | /55 | 1
100| Toronto | 77 | /55/77 | 2
104| Brampton| 100 | /55/77/100 | 3
我希望将这些行转换为展平版本,示例输出将为:
-----------------------------------
Id | name | parentId | depth
------------------------------------
104| Brampton | Toronto | 3
100| Toronto | Ontario | 2
77 | Ontario | Canada | 1
55 | Canada | None | 0
100| Toronto | Ontario | 2
77 | Ontario | Canada | 1
55 | Canada | None | 0
77 | Ontario | Canada | 1
55 | Canada | None | 0
55 | Canada | None | 0
我尝试使用笛卡尔式或者喜欢n2搜索,但它们都没有工作。
答案 0 :(得分:0)
以下是一种方式:
//Creating DF with your data
def getSeq(s:String): Seq[String] = { s.split('|').map(_.trim).toSeq }
var l = getSeq("77 | Ontario | 55 | /55 | 1") :: Nil
l :+= getSeq("55 | Canada | null | null | 0")
l :+= getSeq("100| Toronto | 77 | /55/77 | 2")
l :+= getSeq("104| Brampton| 100 | /55/77/100 | 3")
val df = l.map(x => x match { case Seq(a,b,c,d,e) => (a,b,c,d,e) }).toDF("Id", "name", "parentId", "path", "depth")
//original DF with parentName using a self join
val dfWithPar = df.as("df1").join(df.as("df2"), $"df1.parentId" === $"df2.Id", "leftouter").select($"df1.Id",$"df1.name",$"df1.parentId",$"df1.path",$"df1.depth",$"df2.name".as("parentName"))
// Split path as per requirement and get the exploded DF
val dfExploded = dfWithPar.withColumn("path", regexp_replace($"path", "^/", "")).withColumn("path", split($"path","/")).withColumn("path", explode($"path"))
//Join orig with exploded to get addendum of rows as per individual path placeholders
val dfJoined = dfWithPar.join(dfExploded, dfWithPar.col("Id") === dfExploded.col("path")).select(dfWithPar.col("Id"), dfWithPar.col("name"), dfWithPar.col("parentId"), dfWithPar.col("path"), dfWithPar.col("depth"), dfWithPar.col("parentName"))
//Get the final result by adding the addendum to orig
dfWithPar.union(dfJoined).select($"Id", $"name", $"parentName", $"depth").show
+---+--------+----------+-----+
| Id| name|parentName|depth|
+---+--------+----------+-----+
| 77| Ontario| Canada| 1|
| 55| Canada| null| 0|
|100| Toronto| Ontario| 2|
|104|Brampton| Toronto| 3|
| 77| Ontario| Canada| 1|
| 77| Ontario| Canada| 1|
| 55| Canada| null| 0|
| 55| Canada| null| 0|
| 55| Canada| null| 0|
|100| Toronto| Ontario| 2|
+---+--------+----------+-----+
答案 1 :(得分:0)
自我加入条件和选择合适的列应该适合您。
解决方案有点棘手,因为您需要查找路径列中的每个父名称,包括需要concat_ws
,split
和{{1}的papentId列 } 内置函数。该流程的其余部分为explode
,joins
和selects
。
给定数据框:
fills
您可以生成最终加入的临时数据框
+---+--------+--------+----------+-----+
|Id |name |parentId|path |depth|
+---+--------+--------+----------+-----+
|55 |Canada |null |null |0 |
|77 |Ontario |55 |/55 |1 |
|100|Toronto |77 |/55/77 |2 |
|104|Brampton|100 |/55/77/100|3 |
+---+--------+--------+----------+-----+
通过执行
可以实现所需的数据帧val df2 = df.as("table1")
.join(df.as("table2"), col("table1.parentId") === col("table2.Id"), "left")
.select(col("table1.Id").as("path"), col("table1.name").as("name"), col("table2.name").as("parentId"), col("table1.depth").as("depth"))
.na.fill("None")
// +----+--------+--------+-----+
// |path|name |parentId|depth|
// +----+--------+--------+-----+
// |55 |Canada |None |0 |
// |77 |Ontario |Canada |1 |
// |100 |Toronto |Ontario |2 |
// |104 |Brampton|Toronto |3 |
// +----+--------+--------+-----+
<强>解释强>
df.withColumn("path", explode(split(concat_ws("", col("parentId"), col("path")), "/")))
.as("table1")
.join(df2.as("table2"), Seq("path"), "right")
.select(col("table2.path").as("Id"), col("table2.name").as("name"), col("table2.parentId").as("parentId"), col("table2.depth").as("depth"))
.na.fill("0")
.show(false)
// +---+--------+--------+-----+
// |Id |name |parentId|depth|
// +---+--------+--------+-----+
// |55 |Canada |None |0 |
// |55 |Canada |None |0 |
// |55 |Canada |None |0 |
// |55 |Canada |None |0 |
// |77 |Ontario |Canada |1 |
// |77 |Ontario |Canada |1 |
// |77 |Ontario |Canada |1 |
// |100|Toronto |Ontario |2 |
// |100|Toronto |Ontario |2 |
// |104|Brampton|Toronto |3 |
// +---+--------+--------+-----+
行
|104|Brampton|100 |/55/77/100|3 |
会生成concat_ws("", col("parentId"), col("path"))
,因为您可以在正面看到 100正在连接
|104|Brampton|100 |100/55/77/100|3 |
生成数组列为split(concat_ws("", col("parentId"), col("path")), "/")
并且|104|Brampton|100 |[100, 55, 77, 100]|3 |
作为一个整体将将数组列分解为单独的行
explode(split(concat_ws("", col("parentId"), col("path")), "/"))
|104|Brampton|100 |100 |3 |
|104|Brampton|100 |55 |3 |
|104|Brampton|100 |77 |3 |
|104|Brampton|100 |100 |3 |
更清楚地理解哪些不需要解释;)
我希望答案很有帮助
答案 2 :(得分:0)
这是另一个版本:
val sparkConf = new SparkConf().setAppName("pathtest").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._
var dfA = spark.createDataset(Seq(
(55, "Canada", -1, "", 0),
(77, "Ontario", 55, "/55", 1),
(100, "Toronto", 77, "/55/77", 2),
(104, "Brampton", 100, "/55/77/100", 3))
)
.toDF("Id", "name", "parentId", "path", "depth")
def getArray = udf((path: String) => {
if (path.contains("/"))
path.split("/")
else
Array[String](null)
})
val dfB = dfA
.withColumn("path", getArray(col("path")))
.withColumn("path", explode(col("path")))
.toDF()
dfB.as("B").join(dfA.as("A"), $"B.parentId" === $"A.Id", "left")
.select($"B.Id".as("Id"), $"B.name".as("name"), $"A.name".as("parent"), $"B.depth".as("depth"))
.show()
我有2个数据帧dfA和dfB,它是从第一个产生的。通过爆炸路径数组,用udf生成dfB。请注意,加拿大的技巧是返回一个空数组,否则爆炸不会生成一行。
dfB看起来像这样:
+---+--------+--------+----+-----+
| Id| name|parentId|path|depth|
+---+--------+--------+----+-----+
| 55| Canada| -1|null| 0|
| 77| Ontario| 55| | 1|
| 77| Ontario| 55| 55| 1|
|100| Toronto| 77| | 2|
|100| Toronto| 77| 55| 2|
|100| Toronto| 77| 77| 2|
|104|Brampton| 100| | 3|
|104|Brampton| 100| 55| 3|
|104|Brampton| 100| 77| 3|
|104|Brampton| 100| 100| 3|
+---+--------+--------+----+-----+
最后的结果如下:
+---+--------+-------+-----+
| Id| name| parent|depth|
+---+--------+-------+-----+
| 55| Canada| null| 0|
| 77| Ontario| Canada| 1|
| 77| Ontario| Canada| 1|
|100| Toronto|Ontario| 2|
|100| Toronto|Ontario| 2|
|100| Toronto|Ontario| 2|
|104|Brampton|Toronto| 3|
|104|Brampton|Toronto| 3|
|104|Brampton|Toronto| 3|
|104|Brampton|Toronto| 3|
+---+--------+-------+-----+