如何为PySpark SQL架构中的列定义日期格式以将其解析为时间戳?
我只能在StringType
字段的其他DataFrame转换步骤中执行此操作,这很麻烦。
谢谢!
这是我现在申请的间接方式:
给出输入.csv数据,如:
1,Foo,19/06/2017 14:41:20
2,Bar,19/06/2018 15:41:45
我做:
field_integer = StructField('my_integer', IntegerType())
field_string = StructField('my_string', StringType())
# Want some date format here but need to treat date field as StringType.
field_date = StructField('my_date', StringType())
my_schema = StructType([field_integer, field_string, field_date])
df_test = spark.read.schema(my_schema).csv('my_data.csv')
df_new = df_test.select(['my_integer', 'my_string',
to_timestamp(df_test['my_date'], 'dd/MM/yyyy HH:mm:ss').alias('parsed_date')])
有效,但不是很直接。
df_new.show()
+----------+---------+-------------------+
|my_integer|my_string| parsed_date|
+----------+---------+-------------------+
| 1| David|2015-06-19 15:41:45|
| 2| Blah|2015-06-19 15:41:45|
+----------+---------+-------------------+
print(df_new.printSchema())
root
|-- my_integer: integer (nullable = true)
|-- my_string: string (nullable = true)
|-- parsed_date: timestamp (nullable = true)