Question

我正在AWS Glue上使用PySpark。当写一个以日期列作为分区键的数据集时出现，它总是转换成字符串吗？

df = df \
  .withColumn("querydatetime", to_date(df["querydatetime"], DATE_FORMAT_STR))
...
df \
  .repartition("querydestinationplace", "querydatetime") \
  .write \
  .mode("overwrite") \
  .partitionBy(["querydestinationplace", "querydatetime"]) \
  .parquet("s3://xxx/flights-test")

我注意到我的雅典娜表DDL

CREATE EXTERNAL TABLE `flights_test`(
  `key` string, 
  `agent` int, 
  `queryoutbounddate` date, 
  `queryinbounddate` date, 
  `price` decimal(10,2), 
  `outdeparture` timestamp, 
  `indeparture` timestamp, 
  `numberoutstops` int, 
  `out_is_holiday` boolean, 
  `out_is_longweekends` boolean, 
  `in_is_holiday` boolean, 
  `in_is_longweekends` boolean)
PARTITIONED BY ( 
  `querydestinationplace` string, 
  `querydatetime` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://xxx/flights-test/'
TBLPROPERTIES (...)

通知

PARTITIONED BY ( 
  `querydestinationplace` string, 
  `querydatetime` string)

分区列必须始终为字符串吗？实际上querydestinationplace应该是int类型。此字符串类型的效率是否会低于Int或Date？

Answer 1

这是木地板的已知行为。您可以在读取实木复合地板文件之前添加以下行以忽略此行为：

# prevent casting the integer id fields, which are used for patitioning, 
# to be converted to integers.
sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

PySpark / Glue：当使用日期列作为分区键时，它总是转换为String吗？

1 个答案: