我正在使用spark-redshift从Redshift集群读取数据。代码示例:
spark.read
.format("com.databricks.spark.redshift")
.option("url", s"$redshiftUrl/$redshiftDatabase")
.option("user", redshiftUsername)
.option("password", redshiftPassword)
.option("tempdir", s"s3a://$redshiftTempBucket")
.option("driver", redshiftDriver)
.option("tempformat", "CSV GZIP")
.option("aws_iam_role", redshiftAwsIamRole)
.option("dbtable", s"${table.schema}.${table.name}")
.load()
.filter(col("SOME_COLUMN_TO_FILTER_BY") === lit(23))
扩展的身体计划如下:
== Physical Plan ==
InMemoryTableScan [columns..., ... 45 more fields]
+- InMemoryRelation [columns..., ... 45 more fields], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Filter (cast(SOME_COLUMN_TO_FILTER_BY#7022 as bigint) = 23)
+- *(1) Scan RedshiftRelation(${table.schema}.${table.name}) [columns...,... 45 more fields] PushedFilters: [*IsNotNull(SOME_COLUMN_TO_FILTER_BY)], ReadSchema: struct<...>
因此,正如您在PushedFilters中看到的那样,我们仅检查IsNotNull且不等于23:PushedFilters: [*IsNotNull(SOME_COLUMN_TO_FILTER_BY)]
结果,查询Redshift用于将数据卸载到S3:
SELECT columns... FROM "schema"."table" WHERE "SOME_COLUMN_TO_FILTER_BY" IS NOT NULL
我的问题:为什么没有将等式23推到Redshift用于将数据卸载到S3的查询中?