我想将功能应用于数据框中的列。要应用的功能取决于数据帧中列之一的值。函数到值的映射可用作映射。
源DF:
TAG Timestamp Value
TAG1 2019-06-21 01:16:00.0 621.0947
TAG1 2019-06-21 01:16:00.0 621.0947
TAG1 2019-06-21 01:16:00.0 621.0947
TAG1 2019-06-21 01:16:00.0 619.9578
TAG2 2019-06-21 01:29:00.0 767.5475
TAG2 2019-06-21 01:29:00.0 768.9506
TAG2 2019-06-21 01:29:00.0 770.8863
TAG3 2019-06-21 01:16:00.0 203.7457
地图:
Map(Tag1 -> avg, Tag2 -> max, Tag3 -> min)
输出:
TAG Timestamp Value
TAG1 2019-06-21 01:16:00.0 620.810475 (avg applied for Tag1 values)
TAG2 2019-06-21 01:29:00.0 770.8863 (max applied)
TAG3 2019-06-21 01:16:00.0 203.7457 (min applied)
我能够达到将所有值汇总到一列中的目的,而卡住的地方就是动态应用功能。
什么都没有在工作状态。所以我认为可行的是将这些值作为列中的列表获取,然后尝试应用函数。
val grouped = df.groupBy("TAG").agg(collect_list("value") as "value")
输出:
TAG Timestamp Value
TAG1 2019-06-21 01:16:00.0 620.810475 (avg applied for Tag1 values)
TAG2 2019-06-21 01:29:00.0 770.8863 (max applied)
TAG3 2019-06-21 01:16:00.0 203.7457 (min applied)
答案 0 :(得分:1)
您可以像案例语句一样使用when...otherwise
import spark.implicits._
var df = Seq(
("TAG1", "2019-06-21 01:16:00.0",621.0947),
("TAG1", "2019-06-21 01:16:00.0",621.0947),
("TAG1", "2019-06-21 01:16:00.0",621.0947),
("TAG1", "2019-06-21 01:16:00.0",619.9578),
("TAG2", "2019-06-21 01:29:00.0",767.5475),
("TAG2", "2019-06-21 01:29:00.0",768.9506),
("TAG2", "2019-06-21 01:29:00.0",770.8863),
("TAG3", "2019-06-21 01:16:00.0",203.7457)).toDF("TAG","Timestamp","Value")
df.groupBy(
"TAG","Timestamp"
).agg(
when(
col("TAG") === "TAG1", avg("Value")
).otherwise(
when(col("TAG") === "TAG2", max("Value")).otherwise(min("Value"))
).as("Value")
).show()