Question

我在

中有两列

import org.apache.spark.sql.types.{TimestampType, ArrayType}

statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS").cast("Timestamp"))
statusWithOutDuplication.withColumn("responseTime", unix_timestamp( col("responseTime"), "YYYY-MM-DD HH:MM:SS").cast("Timestamp"))

我想传递requestTime＆amp; responseTime进入下面的UDF并找到之后的差异将分钟和秒设置为“0”

val split_hour_range_udf = udf { (startDateTime: TimestampType ,
                                      endDateTime: TimestampType ) =>

      }

在Python中，我们“替换”（startDateTime.replace(second=0,minute=0)）Scala中的等价物是什么？

Answer 1

您可以创建UDF，如下所示，将值作为字符串发送，稍后转换为Timestamp。在UDF

val timeDFiff = udf((start: String , end : String) => {
  //convert to timestamp and find the difference
})

并将其用作

df.withColumn("responseTime", timeDiff($"requestTime", $"responseTime"))

您可以使用内置的Spark函数，例如dateDiff

，而不是使用UDF

Answer 2

你可以这样做：

import org.apache.spark.sql.types.{TimestampType, ArrayType}

statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS"))
statusWithOutDuplication.withColumn("responseTime", unix_timestamp( col("responseTime"), "YYYY-MM-DD HH:MM:SS"))

//This resets minute and second to 0
def resetMinSec(colName: String) = {
    col(colName) - minute(col(colName).cast("TimeStamp"))*60 - second(col(colname).cast("Timestamp"))
}

//create a new column with the difference between unixtimes
statusWithOutDuplication.select((resetMinSec("responseTime") - resetMinSec("requestTime")).as("diff"))

请注意，我没有将requestTime / responseTime投射到“时间戳”，您应该在找到差异后进行投射。

udf方法应该类似，但使用一些scala方法从时间戳中获取分钟/秒。

希望这有点帮助！

TimestampType Scala中的差异和重置小时

2 个答案: