Question

我有两列的pyspark数据框，后来我使用withColumn函数添加了第三列，以将当前日期添加到所有存在的行中。

df.printSchema()
Name --- string
City ----string

df.withColumn("created_date",current_date())

df.printSchema()
Name --- string
City --- string
created_date --- Date

df.show(2)
Name   City   created_date
Greg   MN     2020-09-13
John   NY     2020-09-13

之后，我使用以下命令将文件保存到s3存储桶中

df.write.format（“ csv”）.option（“ header”，“ true”）.option（“ delimiter”，“，”）.save（“ s3：// location”）

稍后，我正在尝试使用pyspark从s3读取csv文件，created_date列的数据类型已更改为时间戳。

df1 = spark.read.format("csv").option("header","true").option("delimiter",",").option("inferschema","true").load("s3://location/xxxx.csv")

df1.printSchema()
Name --- string
City --- string
created_date --- Timestamp

 df1.show(2)
 Name   City   created_date
 Greg   MN     2020-09-13 00:00:00
 John   NY     2020-09-13 00:00:00

有人知道从s3读取文件时为什么created_date列数据类型更改为timestamp而不是date吗？实际上，我在阅读时正在寻找日期数据类型，感谢您的帮助！

Answer 1

该行为与S3无关，而是与Spark i在读取时如何获取数据类型有关。

在非平凡的情况下，架构推断可能会导致意外的行为，在您的情况下，Timestamp字段被解释为from pyspark.sql.types import StructType, IntegerType, DateType customSchema = StructType([ StructField("Name", StructType()), StructField("City", StructType()), StructField("created_date", DateType()) ]) df1 = spark.read.format("csv") .option("delimiter"," ") .option("header", "true") .schema(customSchema) .load("s3://location/xxxx.csv")，并且日期，小时，分钟和秒均为正确，但无正确数据，因为没有数据这些数字。

尝试在读取时明确设置架构：

{{1}}

从亚马逊s3存储桶读取csv文件时列数据类型发生变化

1 个答案: