下面是我的数据框。
import spark.implicits._
val lastRunDtDF = sc.parallelize(Seq(
(1, 2,"2019-07-18 13:34:24")
)).toDF("id", "cnt","run_date")
lastRunDtDF.show
+---+---+-------------------+
| id|cnt| run_date|
+---+---+-------------------+
| 1| 2|2019-07-18 13:34:24|
+---+---+-------------------+
我想通过将2分钟添加到现有的run_date列中来创建一个新的数据框,并将新列作为new_run_date。示例输出如下。
+---+---+-------------------+-------------------+
| id|cnt| run_date| new_run_date|
+---+---+-------------------+-------------------+
| 1| 2|2019-07-18 13:34:24|2019-07-18 13:36:24|
+---+---+-------------------+-------------------+
我正在尝试类似以下的内容
lastRunDtDF.withColumn("new_run_date",lastRunDtDF("run_date")+"INTERVAL 2 MINUTE")
看起来不正确。预先感谢您的帮助。
答案 0 :(得分:0)
尝试在 INTERVAL 2 MINUTE
函数中包装expr
。
import org.apache.spark.sql.functions.expr
lastRunDtDF.withColumn("new_run_date",lastRunDtDF("run_date") + expr("INTERVAL 2 MINUTE"))
.show()
结果:
+---+---+-------------------+-------------------+
| id|cnt| run_date| new_run_date|
+---+---+-------------------+-------------------+
| 1| 2|2019-07-18 13:34:24|2019-07-18 13:36:24|
+---+---+-------------------+-------------------+
(或)
使用from_unixtime,unix_timestamp函数:
import org.apache.spark.sql.functions._
lastRunDtDF.selectExpr("*","from_unixtime(unix_timestamp(run_date) + 2*60,
'yyyy-MM-dd HH:mm:ss') as new_run_date")
.show()
结果:
+---+---+-------------------+-------------------+
| id|cnt| run_date| new_run_date|
+---+---+-------------------+-------------------+
| 1| 2|2019-07-18 13:34:24|2019-07-18 13:36:24|
+---+---+-------------------+-------------------+