我在Spark 2(Scala)中有一个DataFrame,其中一列是unix时间戳。我想从现在开始只获取最近10小时内的行。
我该怎么做?
val hours = 10
val result = df.filter($"unix_timestamp" > hours)
答案 0 :(得分:0)
使用数据:
import org.apache.spark.sql.functions.{current_timestamp, expr}
val df = Seq(1517877887, 1517935463, 1517949824).toDF("unix_timestamp")
df.select($"unix_timestamp", $"unix_timestamp".cast("timestamp")).show
// +--------------+-------------------+
// |unix_timestamp| unix_timestamp|
// +--------------+-------------------+
// | 1517877887|2018-02-06 00:44:47|
// | 1517935463|2018-02-06 16:44:23|
// | 1517949824|2018-02-06 20:43:44|
//+--------------+-------------------+
spark.sql("SELECT current_timestamp").show
// +--------------------+
// | current_timestamp()|
// +--------------------+
// |2018-02-06 20:47:...|
// +--------------------+
减去INTERVAL
表达式:
val last10hours = df.where(
$"unix_timestamp".cast("timestamp") > current_timestamp - expr("INTERVAL 10 hours")
)
会给你:
last10hours.select($"unix_timestamp", $"unix_timestamp".cast("timestamp")).show
// +--------------+-------------------+
// |unix_timestamp| unix_timestamp|
// +--------------+-------------------+
// | 1517935463|2018-02-06 16:44:23|
// | 1517949824|2018-02-06 20:43:44|
// +--------------+-------------------+