Question

我正在使用Spark SQL查询Hive中的数据。数据已分区，Spark SQL在查询时正确修剪分区。

但是，为了确定给定查询，我需要列出源表以及分区过滤器或特定输入文件（.inputFiles将是一个明显的选择，但它不反映修剪）计算将在哪部分数据上进行。

我能得到的最接近的是致电df.queryExecution.executedPlan.collectLeaves()。这包含作为HiveTableScanExec实例的相关计划节点。但是，private[hive]包的此类为org.apache.spark.sql.hive。我认为相关字段为relation和partitionPruningPred。

有没有办法实现这个目标？

更新：感谢Jacek的建议以及在返回的getHiveQlPartitions上使用relation并提供partitionPruningPred作为参数，我能够获得相关信息：

scan.findHiveTables(execPlan).flatMap(e => e.relation.getHiveQlPartitions(e.partitionPruningPred))

这包含了我需要的所有数据，包括所有输入文件的路径，正确分区修剪。

Answer 1

嗯，你要求查询执行的低级细节，那里的事情很崎岖。 您已被警告：）

正如您在评论中所述，所有执行信息都在此private[hive] HiveTableScanExec。

了解HiveTableScanExec物理运算符（即执行时的Hive表）的一种方法是在org.apache.spark.sql.hive包中创建一种非private[hive]的后门。

package org.apache.spark.sql.hive

import org.apache.spark.sql.hive.execution.HiveTableScanExec
object scan {
  def findHiveTables(execPlan: org.apache.spark.sql.execution.SparkPlan) = execPlan.collect { case hiveTables: HiveTableScanExec => hiveTables }
}

更改代码以满足您的需求。

使用scan.findHiveTables时，我通常会在:paste -raw中使用spark-shell潜入这些＆＃34;未知区域＆＃34;。

然后您可以简单地执行以下操作：

scala> spark.version
res0: String = 2.4.0-SNAPSHOT

// Create a Hive table
import org.apache.spark.sql.types.StructType
spark.catalog.createTable(
  tableName = "h1",
  source = "hive", // <-- that makes for a Hive table
  schema = new StructType().add($"id".long),
  options = Map.empty[String, String])

// select * from h1
val q = spark.table("h1")
val execPlan = q.queryExecution.executedPlan
scala> println(execPlan.numberedTreeString)
00 HiveTableScan [id#22L], HiveTableRelation `default`.`h1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#22L]

// Use the above code and :paste -raw in spark-shell

import org.apache.spark.sql.hive.scan
scala> scan.findHiveTables(execPlan).size
res11: Int = 1

relation字段是使用Spark分析器用于解析数据源和配置单元表的ResolveRelations和FindDataSourceTable逻辑规则解析后的Hive表。

您可以使用ExternalCatalog接口（可用spark.sharedState.externalCatalog）获取Spark在Hive Metastore中使用的所有信息。这几乎为您提供了Spark用于在Hive表上规划查询的所有元数据。

如何列出Hive表的分区修剪输入？

1 个答案: