尝试使用unix_timestamp
进行简单转换的行为不同于Spark 2.0.2
和2.3.x
。
起初,我以为这可能是与Spark环境相关的问题,例如时区差异。 但是所有设置都相同。
下面的示例显示了所描述的行为。
import org.apache.spark.sql.types.{TimestampType}
case class Dummy(mts:String, sts:String)
val testData = Seq(Dummy("2018-05-09-06.57.53.013768", "2018-05-09-06.57.53.013198"), Dummy("2018-11-21-04.30.03.804441", "2018-11-21-04.30.03.802212")).toDF
val result = testData
.withColumn("time1", unix_timestamp(col("sts"), "yyyy-MM-dd-HH.mm.ss.SSSSSS").cast(TimestampType))
.withColumn("time2", unix_timestamp(col("sts"), "yyyy-MM-dd-HH.mm.ss.SSSSSS").cast(TimestampType))
result.select($"time1", $"time2", $"sts", $"mts").show(false)
scala> spark.version
res25: String = 2.3.1.3.0.1.0-187
scala> result.select("time1", "time2", "sts", "mts").show(false)
+-----+-----+--------------------------+--------------------------+
|time1|time2|sts |mts |
+-----+-----+--------------------------+--------------------------+
|null |null |2018-05-09-06.57.53.013198|2018-05-09-06.57.53.013768|
|null |null |2018-11-21-04.30.03.802212|2018-11-21-04.30.03.804441|
+-----+-----+--------------------------+--------------------------+
scala>
scala> spark.version
def version: String
scala> spark.version
res4: String = 2.0.2
scala> result.select("time1", "time2", "sts", "mts").show(false)
+---------------------+---------------------+--------------------------+--------------------------+
|time1 |time2 |sts |mts |
+---------------------+---------------------+--------------------------+--------------------------+
|2018-05-09 06:58:06.0|2018-05-09 06:58:06.0|2018-05-09-06.57.53.013198|2018-05-09-06.57.53.013768|
|2018-11-21 04:43:25.0|2018-11-21 04:43:27.0|2018-11-21-04.30.03.802212|2018-11-21-04.30.03.804441|
+---------------------+---------------------+--------------------------+--------------------------+
此行为是否有任何特殊原因?
答案 0 :(得分:1)
您遇到的问题与功能unix_timestamp
有关。
它将以秒为单位的字符串转换为Unix时间戳。因此,几秒钟后的所有内容都会被忽略。
Spark 2.0.2非常宽容,并且将模式的SSSSSS
部分替换为0。
但是,在Spark 2.0.2和2.3.x之间的某个地方,实现发生了变化,并且您需要null
来引起注意。
如何解决?只需删除.SSSSSS
,它看起来像这样:
val result = testData
.withColumn("time1", unix_timestamp(col("sts"), "yyyy-MM-dd-HH.mm.ss").cast(TimestampType))
.withColumn("time2", unix_timestamp(col("sts"), "yyyy-MM-dd-HH.mm.ss").cast(TimestampType))
result.select("time1", "time2", "sts", "mts").show(false)
+-------------------+-------------------+--------------------------+--------------------------+
|time1 |time2 |sts |mts |
+-------------------+-------------------+--------------------------+--------------------------+
|2018-05-09 06:57:53|2018-05-09 06:57:53|2018-05-09-06.57.53.013198|2018-05-09-06.57.53.013768|
|2018-11-21 04:30:03|2018-11-21 04:30:03|2018-11-21-04.30.03.802212|2018-11-21-04.30.03.804441|
+-------------------+-------------------+--------------------------+--------------------------+