我有一个配置单元表,该配置表按“日期”字段进行了分区 我想编写一个查询以从最新(最大)分区中获取数据。
spark.sql("select field from table where date_of = '2019-06-23'").explain(True)
vs
spark.sql("select filed from table where date_of = (select max(date_of) from table)").explain(True)
下面是两个查询的物理计划
*(1) Project [qbo_company_id#120L]
+- *(1) FileScan parquet
table[qbo_company_id#120L,date_of#157] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[s3location..., PartitionCount: 1, PartitionFilters: [isnotnull(date_of#157), (cast(date_of#157 as string) = 2019-06-23)], PushedFilters: [], ReadSchema: struct<qbo_company_id:bigint>
*(1) Project [qbo_company_id#1L]
+- *(1) Filter (date_of#38 = Subquery subquery0)
: +- Subquery subquery0
: +- *(2) HashAggregate(keys=[], functions=[max(date_of#76)], output=[max(date_of)#78])
: +- Exchange SinglePartition
: +- *(1) HashAggregate(keys=[], functions=[partial_max(date_of#76)], output=[max#119])
: +- LocalTableScan [date_of#76]
+- *(1) FileScan parquet
table[qbo_company_id#1L,date_of#38] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[s3location..., PartitionCount: 1836, PartitionFilters: [isnotnull(date_of#38)], PushedFilters: [], ReadSchema: struct<qbo_company_id:bigint>
答案 0 :(得分:2)
如果我是您...我宁愿使用其他方法,而不是sql查询和全表扫描。
spark.sql(s"show partitions $tablename")
然后,我将其转换为具有joda日期列的Seq[scala.collection.immutable.Map[String, org.joda.time.DateTime]
/**
* listMyHivePartitions - lists hive partitions as sequence of map
* @param tableName String
* @param spark SparkSession
* @return Seq[Map[String, DateTime]]
*/
def listMyHivePartitions(tableName :String,spark:SparkSession) : Seq[Map[String, DateTime]] = {
println(s"Listing the keys from ${tableName}")
val partitions: Seq[String] = spark.sql(s"show partitions ${tableName}").collect().map(row => {
println(s" Identified Key: ${row.toString()}")
row.getString(0)
}).toSeq
println(s"Fetched ${partitions.size} partitons from ${tableName}")
partitions.map(key => key.split("/").toSeq.map(keyVal => {
val keyValSplit = keyVal.split("=")
(keyValSplit(0).toLowerCase().trim, new DateTime(keyValSplit(1).trim))
}).toMap)
}
并将适用
getRecentPartitionDate
如下
/**
* getRecentPartitionDate.
*
* @param column String
* @param seqOfMap { @see Seq[scala.collection.immutable.Map[String, org.joda.time.DateTime]}
**/
def getRecentPartitionDate(column: String, seqOfMap: Seq[scala.collection.immutable.Map[String, org.joda.time.DateTime]]): Option[Map[String, DateTime]] = {
logger.info(" >>>>> column " + column)
val mapWithMostRecentBusinessDate = seqOfMap.sortWith(
(a, b) => {
logger.debug(a(column).toString() + " col2" + b(column).toString())
a(column).isAfter(b(column))
}
)
logger.debug(s" mapWithMostRecentBusinessDate: $mapWithMostRecentBusinessDate , \n Head = ${mapWithMostRecentBusinessDate.headOption} ")
mapWithMostRecentBusinessDate.headOption
}
优点是没有SQL,没有全表扫描...
当您从hivemetastore进行查询时,也可以应用以上内容,因为hivemetastore在后端的数据库将在该文件上显示分区表,查询结果为java.sql.ResultSet
/**
* showParts.
*
* @param table
* @param config
* @param stmt
*/
def showParts(table: String, config: Config, stmt: Statement): Seq[scala.collection.immutable.Map[String, org.joda.time.DateTime]] = {
val showPartitionsCmd = " show partitions " + table;
logger.info("showPartitionsCmd " + showPartitionsCmd)
try {
val resultSet = stmt.executeQuery(showPartitionsCmd)
// checkData(resultSet)
val result = resultToSeq(resultSet);
logger.info(s"partitions of $table ->" + showPartitionsCmd + table);
logger.debug("result " + result)
result
}
catch {
case e: Exception => logger.error(s"Exception occurred while show partitions table $table..", e)
null
}
}
/** *
* resultToSeq.
*
* @param queryResult
*/
def resultToSeq(queryResult: ResultSet) = {
val md = queryResult.getMetaData
val colNames = for (i <- 1 to md.getColumnCount) yield md.getColumnName(i)
var rows = Seq[scala.collection.immutable.Map[String, org.joda.time.DateTime]]()
while (queryResult.next()) {
var row = scala.collection.immutable.Map.empty[String, DateTime]
for (n <- colNames) {
val str = queryResult.getString(n).split("=")
//str.foreach(logger.info)
import org.joda.time.format.DateTimeFormat
val format = DateTimeFormat.forPattern("yyyy-MM-dd")
row += str(0) -> DateTime.parse(str(1)) //.toString(DateTimeFormat.shortDate())
logger.debug(row.toString())
}
rows = rows :+ row
}
rows
}
获得地图序列后,我将def放在顶部,即getRecentPartitionDate
答案 1 :(得分:1)
在Ram的答案的基础上,有一种更简单的方法来完成此任务,该方法通过直接查询Hive元存储库而不是执行Spark-SQL查询来消除大量开销。无需重新发明轮子:
import org.apache.hadoop.hive.conf.HiveConf
import scala.collection.JavaConverters._
import org.apache.hadoop.hive.metastore.HiveMetaStoreClient
val hiveConf = new HiveConf(spark.sparkContext.hadoopConfiguration, classOf[HiveConf])
val cli = new HiveMetaStoreClient(hiveConf)
val maxPart = cli.listPartitions("<db_name>", "<tbl_name>", Short.MaxValue).asScala.map(_.getValues.asScala.mkString(",")).max