Question

我正在使用spark-redshift从Redshift集群读取数据。代码示例：

 spark.read
  .format("com.databricks.spark.redshift")
  .option("url", s"$redshiftUrl/$redshiftDatabase")
  .option("user", redshiftUsername)
  .option("password", redshiftPassword)
  .option("tempdir", s"s3a://$redshiftTempBucket")
  .option("driver", redshiftDriver)
  .option("tempformat", "CSV GZIP")
  .option("aws_iam_role", redshiftAwsIamRole)
  .option("dbtable", s"${table.schema}.${table.name}")
  .load()
  .filter(col("SOME_COLUMN_TO_FILTER_BY") === lit(23))

扩展的身体计划如下：

    == Physical Plan ==
InMemoryTableScan [columns..., ... 45 more fields]
   +- InMemoryRelation [columns..., ... 45 more fields], StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *(1) Filter (cast(SOME_COLUMN_TO_FILTER_BY#7022 as bigint) = 23)
            +- *(1) Scan RedshiftRelation(${table.schema}.${table.name}) [columns...,... 45 more fields] PushedFilters: [*IsNotNull(SOME_COLUMN_TO_FILTER_BY)], ReadSchema: struct<...>

因此，正如您在PushedFilters中看到的那样，我们仅检查IsNotNull且不等于23：PushedFilters: [*IsNotNull(SOME_COLUMN_TO_FILTER_BY)]

结果，查询Redshift用于将数据卸载到S3：

SELECT columns... FROM "schema"."table" WHERE "SOME_COLUMN_TO_FILTER_BY" IS NOT NULL

我的问题：为什么没有将等式23推到Redshift用于将数据卸载到S3的查询中？

火花红移下推滤镜

0 个答案: