我在Spark中有以下DataFrame:
nodeFrom nodeTo value date
1 2 11 2016-10-12T12:10:00.000Z
1 2 12 2016-10-12T12:11:00.000Z
1 2 11 2016-10-12T12:09:00.000Z
4 2 34 2016-10-12T14:00:00.000Z
4 2 34 2016-10-12T14:00:00.000Z
5 3 11 2016-10-12T14:00:00.000Z
我需要删除重复的nodeFrom
和nodeTo
对,同时获取最早和最新的date
以及相应value
值的平均值。
预期输出如下:
nodeFrom nodeTo value date
1 2 11.5 [2016-10-12T12:09:00.000Z,2016-10-12T12:11:00.000Z]
4 2 34 [2016-10-12T14:00:00.000Z]
5 3 11 [2016-10-12T14:00:00.000Z]
答案 0 :(得分:0)
将struct
功能与min
和max
一起使用,只需要一个groupBy
和agg
步骤。
假设这是您的数据:
val data = Seq(
(1, 2, 11, "2016-10-12T12:10:00.000Z"),
(1, 2, 12, "2016-10-12T12:11:00.000Z"),
(1, 2, 11, "2016-10-12T12:09:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(5, 3, 11, "2016-10-12T14:00:00.000Z")
).toDF("nodeFrom", "nodeTo", "value", "date")
data.show()
您可以按如下方式获取最早/最晚日期的平均值和数组:
import org.apache.spark.sql.functions._
data
.groupBy('nodeFrom, 'nodeTo).agg(
min(struct('date, 'value)) as 'date1,
max(struct('date, 'value)) as 'date2
)
.select(
'nodeFrom, 'nodeTo,
($"date1.value" + $"date2.value") / 2.0d as 'value,
array($"date1.date", $"date2.date") as 'date
)
.show(60, false)
这会给你几乎你想要的东西,每个日期数组的小差异都有2
的大小:
+--------+------+-----+----------------------------------------------------+
|nodeFrom|nodeTo|value|date |
+--------+------+-----+----------------------------------------------------+
|1 |2 |11.5 |[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
|5 |3 |11.0 |[2016-10-12T14:00:00.000Z, 2016-10-12T14:00:00.000Z]|
|4 |2 |34.0 |[2016-10-12T14:00:00.000Z, 2016-10-12T14:00:00.000Z]|
+--------+------+-----+----------------------------------------------------+
如果你真的(真的吗?)想要消除数组列中的重复项,那么最简单的方法是使用自定义udf
:
val elimDuplicates = udf((_: collection.mutable.WrappedArray[String]).distinct)
data
.groupBy('nodeFrom, 'nodeTo).agg(
min(struct('date, 'value)) as 'date1,
max(struct('date, 'value)) as 'date2
)
.select(
'nodeFrom, 'nodeTo,
($"date1.value" + $"date2.value") / 2.0d as 'value,
elimDuplicates(array($"date1.date", $"date2.date")) as 'date
)
.show(60, false)
这将产生:
+--------+------+-----+----------------------------------------------------+
|nodeFrom|nodeTo|value|date |
+--------+------+-----+----------------------------------------------------+
|1 |2 |11.5 |[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
|5 |3 |11.0 |[2016-10-12T14:00:00.000Z] |
|4 |2 |34.0 |[2016-10-12T14:00:00.000Z] |
+--------+------+-----+----------------------------------------------------+
简要说明:
min(struct('date, 'value)) as date1
选择最早的日期和相应的值max
2
希望有所帮助。
答案 1 :(得分:0)
你可以做一个普通的groupBy,然后使用udf根据需要制作日期列,如下所示:
val df = Seq(
(1, 2, 11, "2016-10-12T12:10:00.000Z"),
(1, 2, 12, "2016-10-12T12:11:00.000Z"),
(1, 2, 11, "2016-10-12T12:09:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(5, 3, 11, "2016-10-12T14:00:00.000Z")
).toDF("nodeFrom", "nodeTo", "value", "date")
def zipDates = udf((date1: String, date2: String) => {
if (date1 == date2)
Seq(date1)
else
Seq(date1, date2)
})
val dfT = df
.groupBy('nodeFrom, 'nodeTo)
.agg(avg('value) as "value", min('date) as "minDate", max('date) as "maxDate")
.select('nodeFrom, 'nodeTo, 'value, zipDates('minDate, 'maxDate) as "date")
dfT.show(10, false)
// +--------+------+------------------+----------------------------------------------------+
// |nodeFrom|nodeTo|value |date |
// +--------+------+------------------+----------------------------------------------------+
// |1 |2 |11.333333333333334|[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
// |5 |3 |11.0 |[2016-10-12T14:00:00.000Z] |
// |4 |2 |34.0 |[2016-10-12T14:00:00.000Z] |
// +--------+------+------------------+----------------------------------------------------+