我在Dataframe下面:
+---+--------+---------+-------+ |age|children|education| income| +---+--------+---------+-------+ | 50| 2| null| null| | 34| 4| null| null| | 34| null| true|60000.0| | 32| null| false|35000.0| +---+--------+---------+-------+
我想要输出如下内容:
+---+--------+---------+-------+ |age|children|education| income| +---+--------+---------+-------+ | 50| 2| null| null| | 34| 4| true|60000.0| | 32| null| false|35000.0| +---+--------+---------+-------+
您可以看到 age 列包含34个重复项,因此我想合并34行的值(而不是其他行的null值)
谢谢
答案 0 :(得分:0)
如果要求组中的第一个不为null,则可以使用“第一个”功能来实现:
val df = Seq(
(50, Some(2), None, None),
(34, Some(4), None, None),
(34, None, Some(true), Some(60000.0)),
(32, None, Some(false), Some(35000.0))
).toDF("age", "children", "education", "income")
val result = df
.groupBy("age")
.agg(
first("children", ignoreNulls = true).alias("children"),
first("education", ignoreNulls = true).alias("education"),
first("income", ignoreNulls = true).alias("income")
)
result.orderBy("age").show(false)
输出:
+---+--------+---------+-------+
|age|children|education|income |
+---+--------+---------+-------+
|32 |null |false |35000.0|
|34 |4 |true |60000.0|
|50 |2 |null |null |
+---+--------+---------+-------+