Question

我注意到PySpark（可能还有Spark）的to_timestamp函数有点奇怪的行为。看起来它正在正确地将某些字符串转换为时间戳，而将其他格式完全相同的其他字符串转换为null。考虑下面我得出的例子：

times = [['2030-03-10 02:56:07'], ['2030-03-11 02:56:07']]

df_test = spark.createDataFrame(times, schema=StructType([
    StructField("time_string", StringType())
]))
df_test = df_test.withColumn('timestamp', 
                             F.to_timestamp('time_string', 
                                            format='yyyy-MM-dd HH:mm:ss'))
df_test.show(2, False)

这就是我得到的：

+-------------------+-------------------+
|time_string        |timestamp          |
+-------------------+-------------------+
|2030-03-10 02:56:07|null               |
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+

正确转换第二个字符串而不转换第一个字符串的原因是什么？我也尝试过使用unix_timestamp()函数，结果是相同的。

更奇怪的是，如果我不使用format参数，我将不再得到null，但是时间戳的小时数增加了一个。

df_test2 = df_test.withColumn('timestamp', F.to_timestamp('time_string'))
df_test2.show(2, False)

结果：

+-------------------+-------------------+
|time_string        |timestamp          |
+-------------------+-------------------+
|2030-03-10 02:56:07|2030-03-10 03:56:07|
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+

知道发生了什么吗？

更新：

我也通过spark-shell在Scala中尝试过，结果是相同的：

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions

val times = Seq(Row("2030-03-10 02:56:07"), Row("2030-03-11 02:56:07"))
val schema=List((StructField("time_string", StringType)))
val df = spark.createDataFrame(spark.sparkContext.parallelize(times), 
                               StructType(schema))
val df_test = df.withColumn("timestamp", 
                      functions.to_timestamp(functions.col("time_string"), 
                                             fmt="yyyy-MM-dd HH:mm:ss"))

df_test.show()

结果：

+-------------------+-------------------+
|        time_string|          timestamp|
+-------------------+-------------------+
|2030-03-10 02:56:07|               null|
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+

PySpark to_timestamp（）的怪异行为

0 个答案: