根据不同的列值在列上应用函数

时间:2019-07-11 08:13:53

标签: scala apache-spark

我想将功能应用于数据框中的列。要应用的功能取决于数据帧中列之一的值。函数到值的映射可用作映射。

源DF:

TAG       Timestamp              Value
TAG1    2019-06-21 01:16:00.0   621.0947
TAG1    2019-06-21 01:16:00.0   621.0947
TAG1    2019-06-21 01:16:00.0   621.0947
TAG1    2019-06-21 01:16:00.0   619.9578
TAG2    2019-06-21 01:29:00.0   767.5475
TAG2    2019-06-21 01:29:00.0   768.9506
TAG2    2019-06-21 01:29:00.0   770.8863
TAG3    2019-06-21 01:16:00.0   203.7457

地图:

Map(Tag1 -> avg, Tag2 -> max, Tag3 -> min)

输出:

TAG Timestamp            Value
TAG1    2019-06-21 01:16:00.0   620.810475  (avg applied for Tag1 values)
TAG2    2019-06-21 01:29:00.0   770.8863    (max applied)
TAG3    2019-06-21 01:16:00.0   203.7457    (min applied)

我能够达到将所有值汇总到一列中的目的,而卡住的地方就是动态应用功能。

什么都没有在工作状态。所以我认为可行的是将这些值作为列中的列表获取,然后尝试应用函数。

val grouped = df.groupBy("TAG").agg(collect_list("value") as "value")

输出:

TAG Timestamp            Value
TAG1    2019-06-21 01:16:00.0   620.810475  (avg applied for Tag1 values)
TAG2    2019-06-21 01:29:00.0   770.8863    (max applied)
TAG3    2019-06-21 01:16:00.0   203.7457    (min applied)

1 个答案:

答案 0 :(得分:1)

您可以像案例语句一样使用when...otherwise

import spark.implicits._
var df = Seq(
  ("TAG1", "2019-06-21 01:16:00.0",621.0947),
  ("TAG1", "2019-06-21 01:16:00.0",621.0947),
  ("TAG1", "2019-06-21 01:16:00.0",621.0947),
  ("TAG1", "2019-06-21 01:16:00.0",619.9578),
  ("TAG2", "2019-06-21 01:29:00.0",767.5475),
  ("TAG2", "2019-06-21 01:29:00.0",768.9506),
  ("TAG2", "2019-06-21 01:29:00.0",770.8863),
  ("TAG3", "2019-06-21 01:16:00.0",203.7457)).toDF("TAG","Timestamp","Value")

df.groupBy(
  "TAG","Timestamp"
).agg(
  when(
    col("TAG") === "TAG1", avg("Value")
  ).otherwise(
    when(col("TAG") === "TAG2", max("Value")).otherwise(min("Value"))
  ).as("Value")
).show()