读取HIVE分区时AWS下推谓词不起作用

时间:2019-09-13 13:58:03

标签: amazon-web-services aws-glue

尝试测试某些粘合功能,并且下推谓词不适用于S3中已分区供HIVE使用的avro文件。我们的分区如下:YYYY-MM-DD。

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

filterpred = "loaddate == '2019-08-08'"

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "hive", 
                                                            table_name = "stuff", 
                                                            pushDownPredicate = filterpred)
print ('############################################')
print "COUNT: ", datasource0.count()
print ('##############################################')

df = datasource0.toDF()
df.show(5)

job.commit()

但是我仍然看到胶水在日期范围之外拉扯。:

Opening 's3://data/2018-11-29/part-00000-a58ee9cb-c82c-46e6-9657-85b4ead2927d-c000.avro' for reading
2019-09-13 13:47:47,071 INFO [Executor task launch worker for task 258] s3n.S3NativeFileSystem (S3NativeFileSystem.java:open(1208)) -
Opening 's3://data/2017-09-28/part-00000-53c07db9-05d7-4032-aa73-01e239f509cf.avro' for reading

我尝试使用以下示例:

AWS Glue DynamicFrames and Push Down Predicate

AWS Glue DynamicFrames and Push Down Predicate

AWS Glue pushdown predicate not working properly

目前,没有提出的解决方案对我有用。我尝试添加分区列(loaddate),将其取出,引用,取消引用等。仍然无法在日期范围内获取。

2 个答案:

答案 0 :(得分:0)

您的代码中存在语法错误。传递给 from_catalog 函数的正确参数是“ push_down_predicate ”,而不是“ pushDownPredicate ”。

示例代码段:

datasource0 = glueContext.create_dynamic_frame.from_catalog(
             database = "hive", 
             table_name = "stuff",
             push_down_predicate = filterpred)

参考AWS文档:https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

答案 1 :(得分:0)

似乎您的分区不是Hive命名风格,因此您必须在查询中use a default one partition_0。另外,如另一个答案所建议,该参数称为push_down_predicate

filterpred = "partition_0 == '2019-08-08'"

datasource0 = glue_context.create_dynamic_frame.from_catalog(
    database = "hive",
    table_name = "stuff",
    push_down_predicate = filterpred)