如何在Spark中删除重复的节点对?

时间:2018-03-12 23:24:43

标签: scala apache-spark spark-dataframe

我在Spark中有以下DataFrame:

nodeFrom    nodeTo    value    date
1           2         11       2016-10-12T12:10:00.000Z
1           2         12       2016-10-12T12:11:00.000Z
1           2         11       2016-10-12T12:09:00.000Z
4           2         34       2016-10-12T14:00:00.000Z
4           2         34       2016-10-12T14:00:00.000Z
5           3         11       2016-10-12T14:00:00.000Z

我需要删除重复的nodeFromnodeTo对,同时获取最早和最新的date以及相应value值的平均值。

预期输出如下:

nodeFrom    nodeTo    value    date
1           2         11.5     [2016-10-12T12:09:00.000Z,2016-10-12T12:11:00.000Z]
4           2         34       [2016-10-12T14:00:00.000Z]
5           3         11       [2016-10-12T14:00:00.000Z]

2 个答案:

答案 0 :(得分:0)

struct功能与minmax一起使用,只需要一个groupByagg步骤。

假设这是您的数据:

val data = Seq(
  (1, 2, 11, "2016-10-12T12:10:00.000Z"),
  (1, 2, 12, "2016-10-12T12:11:00.000Z"),
  (1, 2, 11, "2016-10-12T12:09:00.000Z"),
  (4, 2, 34, "2016-10-12T14:00:00.000Z"),
  (4, 2, 34, "2016-10-12T14:00:00.000Z"),
  (5, 3, 11, "2016-10-12T14:00:00.000Z")
).toDF("nodeFrom", "nodeTo", "value", "date")

data.show()

您可以按如下方式获取最早/最晚日期的平均值和数组:

import org.apache.spark.sql.functions._
data
  .groupBy('nodeFrom, 'nodeTo).agg(
    min(struct('date, 'value)) as 'date1,
    max(struct('date, 'value)) as 'date2
  )
  .select(
    'nodeFrom, 'nodeTo, 
    ($"date1.value" + $"date2.value") / 2.0d as 'value, 
    array($"date1.date", $"date2.date") as 'date
  )
  .show(60, false)

这会给你几乎你想要的东西,每个日期数组的小差异都有2的大小:

+--------+------+-----+----------------------------------------------------+
|nodeFrom|nodeTo|value|date                                                |
+--------+------+-----+----------------------------------------------------+
|1       |2     |11.5 |[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
|5       |3     |11.0 |[2016-10-12T14:00:00.000Z, 2016-10-12T14:00:00.000Z]|
|4       |2     |34.0 |[2016-10-12T14:00:00.000Z, 2016-10-12T14:00:00.000Z]|
+--------+------+-----+----------------------------------------------------+

如果你真的(真的吗?)想要消除数组列中的重复项,那么最简单的方法是使用自定义udf

val elimDuplicates = udf((_: collection.mutable.WrappedArray[String]).distinct)
data
  .groupBy('nodeFrom, 'nodeTo).agg(
    min(struct('date, 'value)) as 'date1,
    max(struct('date, 'value)) as 'date2
  )
  .select(
    'nodeFrom, 'nodeTo, 
    ($"date1.value" + $"date2.value") / 2.0d as 'value, 
    elimDuplicates(array($"date1.date", $"date2.date")) as 'date
  )
  .show(60, false)    

这将产生:

+--------+------+-----+----------------------------------------------------+
|nodeFrom|nodeTo|value|date                                                |
+--------+------+-----+----------------------------------------------------+
|1       |2     |11.5 |[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
|5       |3     |11.0 |[2016-10-12T14:00:00.000Z]                          |
|4       |2     |34.0 |[2016-10-12T14:00:00.000Z]                          |
+--------+------+-----+----------------------------------------------------+

简要说明:

  • min(struct('date, 'value)) as date1选择最早的日期和相应的值
  • max
  • 相同
  • 通过求和除以2
  • ,直接从这两个元组计算平均值
  • 相应的值将写入数组列
  • (可选)对阵列进行重复数据删除

希望有所帮助。

答案 1 :(得分:0)

你可以做一个普通的groupBy,然后使用udf根据需要制作日期列,如下所示:

val df = Seq(
  (1, 2, 11, "2016-10-12T12:10:00.000Z"),
  (1, 2, 12, "2016-10-12T12:11:00.000Z"),
  (1, 2, 11, "2016-10-12T12:09:00.000Z"),
  (4, 2, 34, "2016-10-12T14:00:00.000Z"),
  (4, 2, 34, "2016-10-12T14:00:00.000Z"),
  (5, 3, 11, "2016-10-12T14:00:00.000Z")
).toDF("nodeFrom", "nodeTo", "value", "date")

def zipDates = udf((date1: String, date2: String) => {
    if (date1 == date2)
        Seq(date1)
    else
        Seq(date1, date2)
})

val dfT = df
    .groupBy('nodeFrom, 'nodeTo)
    .agg(avg('value) as "value", min('date) as "minDate", max('date) as "maxDate")
    .select('nodeFrom, 'nodeTo, 'value, zipDates('minDate, 'maxDate) as "date")

dfT.show(10, false)
// +--------+------+------------------+----------------------------------------------------+
// |nodeFrom|nodeTo|value             |date                                                |
// +--------+------+------------------+----------------------------------------------------+
// |1       |2     |11.333333333333334|[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
// |5       |3     |11.0              |[2016-10-12T14:00:00.000Z]                          |
// |4       |2     |34.0              |[2016-10-12T14:00:00.000Z]                          |
// +--------+------+------------------+----------------------------------------------------+