我有以下数据框:
---------+--------+----------+-----------+--------------------+--------------------+-------+-----+------------
| id|groupid|| field| oldstring| newstring| created| pkey| project
+-------+-------+---------+--------------------+--------+-------------+--------+-------------+-------+-------+
|1451923| 594128| Team1| [RLA N1]| [N1-UO-SB]| 2013-03-29 13:31:...|DSTECH-55770| 10120|
|1451958| 594140| Team1| [SEP N2]| [SEP N2]| 2013-03-29 13:34:...|DSTECH-56998| 10120|
|1452282| 594308| Team1| [N1-UO-SE]| [SEP N2]| 2013-03-29 14:09:...|DSTECH-57900| 10120|
|1492252| 610736| Team1| [N1-UO-SE]| [SEP N2]| 2013-04-17 08:48:...|DSTECH-59560| 10120|
|5105082|2304145| Team1| [Aucun]|[SEP-SUPPORT]| 2017-09-01 09:46:...| ECO-9781| 10280|
|5105084|2304145| Team2| null| SEP-SUPPORT| 2017-09-01 09:46:...| ECO-9781| 10280|
|5105084|2304145| Team1| [ISR N2]| SEP-SUPPORT | 2013-03-29 13:31:... |DSTECH-57895| 10120|
|1451926|594129 | Team1| [N1-UO-SE]| [ISR N2] |2013-03-29 13:55:... |DSTECH-57895| 10120|
|1452182|594273 | Team1| [N1-UO-SE]| [SEPN1-ENV] |2013-03-29 13:43:... |DSTECH-57895| 10120|
我想计算[pkey]
的治疗日期/时间。例如,我有这两行:
| id|groupid|| field| oldstring| newstring| created| pkey|
+-------+-------+---------+--------------------+--------+-------------+--------+-------------+-------+-------+
|1451923| 594128| Team1| [RLA N1]| [N1-UO-SB]| 2013-03-29 13:31:...|DSTECH-55770|
|1451958| 594140| Team1| [SEP N2]| [SEP N2]| 2013-03-29 13:34:...|DSTECH-56998|
[DSTECH-55770] = [2013-03-29 13:34:...] - [2013-03-29 13:31:...]
如何使用上一个日期计算这种差异,我发现我可以使用用户定义的聚合函数UDAF来完成。但是,如果这个解决方案对于显示一个数字中两个日期之间的差异(例如:8h:30min),我不是这样的,我不是说8H是时钟8H但是小时数是8。
如果有人可以帮助我,我该如何使用UDAF或者如果您有其他解决方案?谢谢
答案 0 :(得分:1)
可能是SQL窗口函数的一种情况。您可以找到更多详细信息here
我怀疑生成的代码可能类似于
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val sparkSession = ... // Create as do
import sparkSession.implicits._
// For the same project, order rows by `created` column
val partitionWindow = Window.partitionBy("project").orderBy("created".asc)
// Get me the value of `created` column in next row in a new column called datediff
val createdTimeNextRowSameProject = lead($"created",
1, // 1 = next_row, 2 = 2 rows after, so on
"CURRENT_TIMESTAMP" // default if next is null
).over(partitionWindow)
val dfWithTimeDiffInSeconds = df.withColumn("datediff", unix_timestamp(leadDate) - unix_timestamp($"created"))
dfWithTimeDiffInSeconds.show(10)