火花红移下推滤镜

时间:2020-03-12 09:42:57

标签: scala apache-spark pyspark apache-spark-sql amazon-redshift

我正在使用spark-redshift从Redshift集群读取数据。代码示例:

 spark.read
  .format("com.databricks.spark.redshift")
  .option("url", s"$redshiftUrl/$redshiftDatabase")
  .option("user", redshiftUsername)
  .option("password", redshiftPassword)
  .option("tempdir", s"s3a://$redshiftTempBucket")
  .option("driver", redshiftDriver)
  .option("tempformat", "CSV GZIP")
  .option("aws_iam_role", redshiftAwsIamRole)
  .option("dbtable", s"${table.schema}.${table.name}")
  .load()
  .filter(col("SOME_COLUMN_TO_FILTER_BY") === lit(23))

扩展的身体计划如下:

    == Physical Plan ==
InMemoryTableScan [columns..., ... 45 more fields]
   +- InMemoryRelation [columns..., ... 45 more fields], StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *(1) Filter (cast(SOME_COLUMN_TO_FILTER_BY#7022 as bigint) = 23)
            +- *(1) Scan RedshiftRelation(${table.schema}.${table.name}) [columns...,... 45 more fields] PushedFilters: [*IsNotNull(SOME_COLUMN_TO_FILTER_BY)], ReadSchema: struct<...>

因此,正如您在PushedFilters中看到的那样,我们仅检查IsNotNull且不等于23:PushedFilters: [*IsNotNull(SOME_COLUMN_TO_FILTER_BY)]

结果,查询Redshift用于将数据卸载到S3:

SELECT columns... FROM "schema"."table" WHERE "SOME_COLUMN_TO_FILTER_BY" IS NOT NULL

我的问题:为什么没有将等式23推到Redshift用于将数据卸载到S3的查询中?

0 个答案:

没有答案
相关问题