如何使用Spark将unix时间戳转换为给定的时区

时间:2017-06-27 10:52:02

标签: apache-spark apache-spark-sql timezone

我需要帮助,因为我似乎迷失在时区:)

我使用Spark 1.6.2

我有这样的时代:

+--------------+-------------------+-------------------+
|unix_timestamp|UTC                |Europe/Helsinki    |
+--------------+-------------------+-------------------+
|1491771599    |2017-04-09 20:59:59|2017-04-09 23:59:59|
|1491771600    |2017-04-09 21:00:00|2017-04-10 00:00:00|
|1491771601    |2017-04-09 21:00:01|2017-04-10 00:00:01|
+--------------+-------------------+-------------------+

Spark机器上的默认时区如下:

#timezone = DefaultTz:欧洲/布拉格,SparkUtilTz:欧洲/布拉格

的输出
logger.info("#timezone = DefaultTz: {}, SparkUtilTz: {}", TimeZone.getDefault.getID, org.apache.spark.sql.catalyst.util.DateTimeUtils.defaultTimeZone.getID)

我想计算在给定时区内按日期和小时分组的时间戳(现在是欧洲/赫尔辛基+ 3小时)。

我的期望:

+----------+---------+-----+
|date      |hour     |count|
+----------+---------+-----+
|2017-04-09|23       |1    |
|2017-04-10|0        |2    |
+----------+---------+-----+

代码(使用from_utc_timestamp):

def getCountsPerTime(sqlContext: SQLContext, inputDF: DataFrame, timeZone: String, aggr: String): DataFrame = {

    import sqlContext.implicits._

    val onlyTime = inputDF.select(
         from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType),  timeZone).alias("time")
    )

    val visitsPerTime =
        if (aggr.equalsIgnoreCase("hourly")) {
            onlyTime.groupBy(
                date_format($"time", "yyyy-MM-dd").alias("date"),
                date_format($"time", "H").cast(DataTypes.IntegerType).alias("hour"),
            ).count()
        } else if (aggr.equalsIgnoreCase("daily")) {
            onlyTime.groupBy(
                date_format($"time", "yyyy-MM-dd").alias("date")
            ).count()
        }

    visitsPerTime.show(false)

    visitsPerTime
}

我得到了什么:'(

+----------+---------+-----+
|date      |hour     |count|
+----------+---------+-----+
|2017-04-09|22       |1    |
|2017-04-09|23       |2    |
+----------+---------+-----+

尝试用to_utc_timestamp包裹它:

def getCountsPerTime(sqlContext: SQLContext, inputDF: DataFrame, timeZone: String, aggr: String): DataFrame = {

    import sqlContext.implicits._

    val onlyTime = inputDF.select(
        to_utc_timestamp(from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), timeZone), DateTimeUtils.defaultTimeZone.getID).alias("time")
    )

    val visitsPerTime = ... //same as above

    visitsPerTime.show(false)

    visitsPerTime
}

我得到了什么:(

+----------+---------+-----+
|tradedate |tradehour|count|
+----------+---------+-----+
|2017-04-09|20       |1    |
|2017-04-09|21       |2    |
+----------+---------+-----+

你知道正确的解决方案是什么吗?

提前感谢您的帮助

1 个答案:

答案 0 :(得分:1)

你的代码对我不起作用,所以我无法复制你得到的最后两个输出。

但是我将为您提供一些关于如何实现预期输出的提示

我假设你已经dataframe作为

+--------------+---------------------+---------------------+
|unix_timestamp|UTC                  |Europe/Helsinki      |
+--------------+---------------------+---------------------+
|1491750899    |2017-04-09 20:59:59.0|2017-04-09 23:59:59.0|
|1491750900    |2017-04-09 21:00:00.0|2017-04-10 00:00:00.0|
|1491750901    |2017-04-09 21:00:01.0|2017-04-10 00:00:01.0|
+--------------+---------------------+---------------------+

我使用以下代码

获得此dataframe
import sqlContext.implicits._
import org.apache.spark.sql.functions._

val inputDF = Seq(
      "2017-04-09 20:59:59",
      "2017-04-09 21:00:00",
      "2017-04-09 21:00:01"
    ).toDF("unix_timestamp")

val onlyTime = inputDF.select(
      unix_timestamp($"unix_timestamp").alias("unix_timestamp"),
      from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType),  "UTC").alias("UTC"),
      from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType),  "Europe/Helsinki").alias("Europe/Helsinki")
    )

onlyTime.show(false)

一旦你有dataframe,获得你想要的输出dataframe就需要split日期,groupbycount如下< / p>

onlyTime.select(split($"Europe/Helsinki", " ")(0).as("date"), split(split($"Europe/Helsinki", " ")(1).as("time"), ":")(0).as("hour"))
          .groupBy("date", "hour").agg(count("date").as("count"))
      .show(false)

结果dataframe

+----------+----+-----+
|date      |hour|count|
+----------+----+-----+
|2017-04-09|23  |1    |
|2017-04-10|00  |2    |
+----------+----+-----+