我正在尝试通过使用下推谓词来优化我的Glue / PySpark作业。
start = date(2019, 2, 13)
end = date(2019, 2, 27)
print(">>> Generate data frame for ", start, " to ", end, "... ")
relaventDatesDf = spark.createDataFrame([
Row(start=start, stop=end)
])
relaventDatesDf.createOrReplaceTempView("relaventDates")
relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates")
relaventDatesDf.createOrReplaceTempView("relaventDates")
print("===LOG:Dates===")
relaventDatesDf.show()
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights", push_down_predicate="""
querydatetime BETWEEN '%s' AND '%s'
AND querydestinationplace IN (%s)
""" % (start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d"), ",".join(map(lambda s: str(s), arr))))
但是,看来Glue仍尝试读取指定日期范围之外的数据?
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-01/part-00045-6cdebbb1-562c-43fa-915d-93b125aeee61.c000.snappy.parquet' for reading
INFO FileScanRDD: Reading File path: s3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet, range: 0-11797922, partition values: [12191,17965]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet' for reading
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
请注意querydatetime=2019-03-01
和querydatetime=2019-03-10
在2019-02-13 - 2019-02-27
指定范围之外。这就是为什么下一行“中止HTTP连接”吗?它继续说“这可能是一个错误,并且可能导致次佳的行为”是不是有问题?
我想知道问题是否是因为它不支持谓词或IN内部的BETWEEN吗?
表创建DDL
CREATE EXTERNAL TABLE `flights`(
`id` string,
`querytaskid` string,
`queryoriginplace` string,
`queryoutbounddate` string,
`queryinbounddate` string,
`querycabinclass` string,
`querycurrency` string,
`agent` string,
`quoteageinminutes` string,
`price` string,
`outboundlegid` string,
`inboundlegid` string,
`outdeparture` string,
`outarrival` string,
`outduration` string,
`outjourneymode` string,
`outstops` string,
`outcarriers` string,
`outoperatingcarriers` string,
`numberoutstops` string,
`numberoutcarriers` string,
`numberoutoperatingcarriers` string,
`indeparture` string,
`inarrival` string,
`induration` string,
`injourneymode` string,
`instops` string,
`incarriers` string,
`inoperatingcarriers` string,
`numberinstops` string,
`numberincarriers` string,
`numberinoperatingcarriers` string)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://pinfare-glue/flights/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-parquet',
'averageRecordSize'='19',
'classification'='parquet',
'compressionType'='none',
'objectCount'='623609',
'recordCount'='4368434222',
'sizeKey'='86509997099',
'typeOfData'='file')
答案 0 :(得分:3)
我在代码中看到的一个问题是,您在ween子句中使用的是“ today”而不是“ end”。尽管我看不到代码中声明了今天变量的今天,但我假设它已使用今天的日期进行了初始化。
在这种情况下,范围会有所不同,并且通过胶水火花读取的分区是正确的。
答案 1 :(得分:3)
为了降低条件,您需要更改表定义子句的partition by子句中的列顺序
在第一个分区列上具有“ in”谓词的条件无法按您期望的那样下推。
请帮助我。
答案 2 :(得分:1)
Pushdown谓词可以与 ween 和 IN 子句一起使用。
只要您在表定义和查询中定义了正确的分区列顺序即可。
我有带有三个分区级别的表。
s3://bucket/flights/year=2018/month=01/day=01 -> 50 records
s3://bucket/flights/year=2018/month=02/day=02 -> 40 records
s3://bucket/flights/year=2018/month=03/day=03 -> 30 records
在dynamicFrame中读取数据
ds = glueContext.create_dynamic_frame.from_catalog(
database = "abc",table_name = "pqr", transformation_ctx = "flights",
push_down_predicate = "(year == '2018' and month between '02' and '03' and day in ('03'))"
)
ds.count()
输出:
30 records
因此,如果正确指定了列顺序,那么您将获得正确的结果。另请注意,您需要在IN子句中指定'(quote) IN('%s')
。
表中的分区列:
querydestinationplace string,
querydatetime string
在DynamicFrame中读取的数据:
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights",
push_down_predicate=
"""querydestinationplace IN ('%s') AND
querydatetime BETWEEN '%s' AND '%s'
"""
%
( ",".join(map(lambda s: str(s), arr)),
start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d")))
答案 3 :(得分:0)
尝试以此结束
start = str(date(2019, 2, 13))
end = str(date(2019, 2, 27))
# Set your push_down_predicate variable
pd_predicate = "querydatetime >= '" + start + "' and querydatetime < '" + end + "'"
#pd_predicate = "querydatetime between '" + start + "' AND '" + end + "'" # Or this one?
flightsGDF = glueContext.create_dynamic_frame.from_catalog(
database = "xxx"
, table_name = "flights"
, transformation_ctx="flights"
, push_down_predicate=pd_predicate)
pd_predicate
是一个将作为push_down_predicate起作用的字符串。
如果您愿意的话,这里是一本好书。
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/