如何在pyspark数据帧中将时间戳转换为bigint

时间:2019-09-15 14:42:17

标签: dataframe pyspark apache-spark-sql

我在Spark环境上使用python并想将数据框库伦从TIMESTAMP数据类型转换为bigint(UNIX时间戳)。这些列是这样的:("yyyy-MM-dd hh:mm:ss.SSSSSS")

timestamp_col               
2014-06-04 10:09:13.334422      
2015-06-03 10:09:13.443322      
2015-08-03 10:09:13.232431

我已经阅读并尝试了其他方法:

from pyspark.sql.functions import from_unixtime, unix_timestamp
from pyspark.sql.types import TimestampType

df1 = df.select((from_unixtime(unix_timestamp(df.timestamp_col, "yyyy-MM-dd hh:mm:ss.SSSSSS"))).cast(TimestampType()).alias("unix_time_col"))

,但是输出给出了相当空值。

+-------------+
|unix_time_col|
+-------------+
|         null|
|         null|
|         null|

我在spark和hadoop版本的hadoop环境中的spark上使用python3.7:spark-2.3.1-bin-hadoop2.7上的google-colaboratory 我一定错过了一些东西。请帮忙吗?

2 个答案:

答案 0 :(得分:0)

from pyspark.sql import SparkSession
from pyspark.sql.functions import unix_timestamp
from pyspark.sql.types import (DateType, StructType, StructField, StringType)

spark = SparkSession.builder.appName('abc').getOrCreate()

column_schema = StructType([StructField("timestamp_col", StringType())])
data = [['2014-06-04 10:09:13.334422'], ['2015-06-03 10:09:13.443322'], ['2015-08-03 10:09:13.232431']]

data_frame = spark.createDataFrame(data, schema=column_schema)

data_frame.withColumn("timestamp_col", data_frame['timestamp_col'].cast(DateType()))
data_frame = data_frame.withColumn('timestamp_col', unix_timestamp('timestamp_col'))
data_frame.show()
  

输出

+-------------+
|timestamp_col|
+-------------+
|   1401894553|
|   1433344153|
|   1438614553|
+-------------+

答案 1 :(得分:0)

请在代码中删除“ .SSSSSS”,然后它将在转换为unixtimestamp时起作用,即代替“ yyyy-MM-dd hh:mm:ss.SSSSSS”,如下所示:

df1 = df.select(unix_timestamp(df.timestamp_col,“ yyyy-MM-dd hh:mm:ss”))