val result = df
.groupBy("col1")
.agg(min('minTimestamp) as "StartDateUTC",
max('maxTimestamp) as "EndDateUTC")
对于每个col1
,我应该找到最小和最大时间戳。问题是在某些情况下StartDateUTC
大于EndDateUTC
(请参阅A
中的案例df
)。在这种情况下,有没有有效的方法来交换这些值?
df =
col1 minTimestamp maxTimestamp
A 1483264800 1483164800
A 1483200000 1483064800
B 1483300000 1483564800
答案 0 :(得分:3)
least
/ greatest
import org.apache.spark.sql.functions._
df.select(
$"col1",
least($"minTimestamp", $"maxTimestamp").alias("minTimestamp"),
greatest($"minTimestamp", $"maxTimestamp").alias("maxTimestamp")
)
或推入聚合
.agg(
min(least($"minTimestamp", $"maxTimestamp")) as "StartDateUTC",
max(greatest($"minTimestamp", $"maxTimestamp")) as "EndDateUTC"
)