如何在DataFrame中过滤10小时的行?

时间:2018-02-06 20:28:51

标签: scala apache-spark dataframe

我在Spark 2(Scala)中有一个DataFrame,其中一列是unix时间戳。我想从现在开始只获取最近10小时内的行。

我该怎么做?

val hours = 10
val result = df.filter($"unix_timestamp" > hours)

1 个答案:

答案 0 :(得分:0)

使用数据:

import org.apache.spark.sql.functions.{current_timestamp, expr}


val df = Seq(1517877887, 1517935463, 1517949824).toDF("unix_timestamp")
df.select($"unix_timestamp", $"unix_timestamp".cast("timestamp")).show
// +--------------+-------------------+
// |unix_timestamp|     unix_timestamp|
// +--------------+-------------------+
// |    1517877887|2018-02-06 00:44:47|
// |    1517935463|2018-02-06 16:44:23|
// |    1517949824|2018-02-06 20:43:44|
//+--------------+-------------------+

spark.sql("SELECT current_timestamp").show
// +--------------------+
// | current_timestamp()|
// +--------------------+
// |2018-02-06 20:47:...|
// +--------------------+

减去INTERVAL表达式:

val last10hours = df.where(
  $"unix_timestamp".cast("timestamp") > current_timestamp - expr("INTERVAL 10 hours")
)

会给你:

last10hours.select($"unix_timestamp", $"unix_timestamp".cast("timestamp")).show
// +--------------+-------------------+
// |unix_timestamp|     unix_timestamp|
// +--------------+-------------------+
// |    1517935463|2018-02-06 16:44:23|
// |    1517949824|2018-02-06 20:43:44|
// +--------------+-------------------+

参考Adding 12 hours to datetime column in Spark