在Dataframe中聚合JSON对象并将字符串时间戳转换为日期

时间:2017-06-19 19:12:11

标签: scala apache-spark

我收到的JSON行如下所示

    [{"time":"2017-03-23T12:20:00","user":"randomUser","action":"sleeping","count":2}]
    [{"time":"2017-03-23T12:30:00","user":"randomUser","action":"sleeping","count":1}]
    [{"time":"2017-03-23T15:30:00","user":"randomUser2","action":"eating","count":2}]

所以我遇到了2个问题,首先将时间存储在我的df中的字符串中,我相信它必须是我约会它们的日期吗?

第二,我需要按5分钟间隔汇总这些数据, 例如,2017-03-23T12:20:00到2017-03-23T12:24:59所发生的一切都需要汇总并视为2017-03-23T12:20:00时间戳

预期产出

{{1}}

感谢

1 个答案:

答案 0 :(得分:0)

您可以使用强制转换将StringType列转换为TimestampType列;然后,您可以将时间戳转换为IntegerType以进行"舍入"最后5分钟的间隔更容易,并按那个(和所有其他列)分组:

// importing SparkSession's implicits
import spark.implicits._

// Use casting to convert String into Timestamp:
val withTime = df.withColumn("time", $"time" cast TimestampType)

// calculate the "most recent 5-minute-round time" and group by all
val result = withTime.withColumn("time", $"time" cast IntegerType)
  .withColumn("time", ($"time" - ($"time" mod 60 * 5)) cast TimestampType)
  .groupBy("time", "user", "action").count()

result.show(truncate = false)
// +---------------------+-----------+--------+-----+
// |time                 |user       |action  |count|
// +---------------------+-----------+--------+-----+
// |2017-03-23 12:20:00.0|randomUser |sleeping|2    |
// |2017-03-23 15:30:00.0|randomUser2|eating  |2    |
// |2017-03-23 12:30:00.0|randomUser |sleeping|1    |
// +---------------------+-----------+--------+-----+