我需要用同一数据框中的另一列过滤列表。
下面是我的DataFrame。在这里,我想用col1过滤col3列表,并仅获取父级的活动子级。
Df.show(10,false):
=============================
Col1 Col2 col3 flag
P1 Parent [c1,c2,c3,c4] Active
c1 Child [] InActive
c2 Child [] Active
c3 Child [] Active
Expected Output :
===================
Df.show(10,false):
Col1 Col2 col3 flag
P1 Parent [c2,c3] Active
c2 Child [] Active
c3 Child [] Active
有人可以帮我得到以上结果吗?
答案 0 :(得分:0)
我这样生成了您的数据框:
val df = Seq(("p1", "Parent", Seq("c1", "c2", "c3", "c4"), "Active"),
("c1", "Child", Seq(), "Inactive"),
("c2", "Child", Seq(), "Active"),
("c3", "Child", Seq(), "Active"))
.toDF("Col1", "Col2", "col3", "flag")
然后,我仅过滤一个数据框中的活动子级,这是输出的一部分:
val active_children = df.where('flag === "Active").where('Col2 === "Child")
我还使用explode
生成了父/子关系的扁平数据框:
val rels = df.withColumn("child", explode('col3))
.select("Col1", "Col2", "flag", "child")
scala> rels.show
+----+------+------+-----+
|Col1| Col2| flag|child|
+----+------+------+-----+
| p1|Parent|Active| c1|
| p1|Parent|Active| c2|
| p1|Parent|Active| c3|
| p1|Parent|Active| c4|
+----+------+------+-----+
和一个只有一列的数据框对应于活动子对象,如下所示:
val child_filter = active_children.select('Col1 as "child")
并使用此child_filter
数据框过滤(通过联接)您感兴趣的父级,并使用groupBy将行聚合回您的输出格式:
val parents = rels
.join(child_filter, "child")
.groupBy("Col1")
.agg(first('Col2) as "Col2",
collect_list('child) as "col3",
first('flag) as "flag")
scala> parents.show
+----+------+--------+------+
|Col1| Col2| col3| flag|
+----+------+--------+------+
| p1|Parent|[c2, c3]|Active|
+----+------+--------+------+
最后,一个联合会产生预期的输出:
scala> parents.union(active_children).show
+----+------+--------+------+
|Col1| Col2| col3| flag|
+----+------+--------+------+
| p1|Parent|[c2, c3]|Active|
| c2| Child| []|Active|
| c3| Child| []|Active|
+----+------+--------+------+