我正在AWS Glue上使用PySpark。当写一个以日期列作为分区键的数据集时出现,它总是转换成字符串吗?
df = df \
.withColumn("querydatetime", to_date(df["querydatetime"], DATE_FORMAT_STR))
...
df \
.repartition("querydestinationplace", "querydatetime") \
.write \
.mode("overwrite") \
.partitionBy(["querydestinationplace", "querydatetime"]) \
.parquet("s3://xxx/flights-test")
我注意到我的雅典娜表DDL
CREATE EXTERNAL TABLE `flights_test`(
`key` string,
`agent` int,
`queryoutbounddate` date,
`queryinbounddate` date,
`price` decimal(10,2),
`outdeparture` timestamp,
`indeparture` timestamp,
`numberoutstops` int,
`out_is_holiday` boolean,
`out_is_longweekends` boolean,
`in_is_holiday` boolean,
`in_is_longweekends` boolean)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://xxx/flights-test/'
TBLPROPERTIES (...)
通知
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
分区列必须始终为字符串吗?实际上querydestinationplace
应该是int类型。此字符串类型的效率是否会低于Int或Date?
答案 0 :(得分:2)
这是木地板的已知行为。您可以在读取实木复合地板文件之前添加以下行以忽略此行为:
# prevent casting the integer id fields, which are used for patitioning,
# to be converted to integers.
sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")