Question

我有一些数据以这种方式分区：

/data/year=2016/month=9/version=0 /data/year=2016/month=10/version=0 /data/year=2016/month=10/version=1 /data/year=2016/month=10/version=2 /data/year=2016/month=10/version=3 /data/year=2016/month=11/version=0 /data/year=2016/month=11/version=1

使用此数据时，我只想加载每个月的最后一个版本。

执行此操作的一种简单方法是执行load("/data/year=2016/month=11/version=3")而不是执行load("/data") 此解决方案的缺点是丢失了year和month等分区信息，这意味着无法再根据年份或月份应用操作。

是否有可能要求Spark仅加载每个月的最后一个版本？你会怎么做？

Answer 1

只是先前答案的补充

我在蜂巢中有一个下面的ORC格式表，该表按年，月和日期列进行分区。

glorot_uniform_initializer

如果设置以下属性，则可以在spark sql中读取最新的分区数据，如下所示：

tf.global_variables_initializer

我们可以在计划中看到PartitionCount：1，很明显，它已经过滤了最新的分区。

hive (default)> show partitions test_dev_db.partition_date_table;
OK
year=2019/month=08/day=07
year=2019/month=08/day=08
year=2019/month=08/day=09

如果我使用以下查询，则相同的方法将不起作用：即使我们使用spark.sql("SET spark.sql.orc.enabled=true"); spark.sql("SET spark.sql.hive.convertMetastoreOrc=true") spark.sql("SET spark.sql.orc.filterPushdown=true") spark.sql("""select * from test_dev_db.partition_date_table where year ='2019' and month='08' and day='07' """).explain(True)创建数据框并创建一个临时视图并在其上运行spark sql。除非我们在此之上使用特定的过滤条件，否则它将仍然扫描该表的所有可用分区。

== Physical Plan ==
*(1) FileScan orc test_dev_db.partition_date_table[emp_id#212,emp_name#213,emp_salary#214,emp_date#215,year#216,month#217,day#218] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxxx/dev/hadoop/database/test_dev..., **PartitionCount: 1**, PartitionFilters: [isnotnull(year#216), isnotnull(month#217), isnotnull(day#218), (year#216 = 2019), (month#217 = 0..., PushedFilters: [], ReadSchema: struct<emp_id:int,emp_name:string,emp_salary:int,emp_date:date>

它仍然扫描了所有三个分区，这里是PartitionCount：3

spark.read.format("orc").load(hdfs absolute path of table)

要使用spark sql根据最大分区过滤出数据，我们可以使用以下方法。我们可以使用以下技术对分区进行修剪，以限制Spark在查询Hive ORC表数据时读取的文件和分区的数量。

spark.sql("""select * from test_dev_db.partition_date_table where year ='2019' and month='08' and day in (select max(day) from test_dev_db.partition_date_table)""").explain(True)

使用这些最大值准备查询以过滤Hive分区表。

== Physical Plan ==
*(2) BroadcastHashJoin [day#282], [max(day)#291], LeftSemi, BuildRight
:- *(2) FileScan orc test_dev_db.partition_date_table[emp_id#276,emp_name#277,emp_salary#278,emp_date#279,year#280,month#281,day#282] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxx/dev/hadoop/database/test_dev..., PartitionCount: 3, PartitionFilters: [isnotnull(year#280), isnotnull(month#281), (year#280 = 2019), (month#281 = 08)], PushedFilters: [], ReadSchema: struct<emp_id:int,emp_name:string,emp_salary:int,emp_date:date>

如果您看到此查询的计划，则可以看到它只扫描了给定Hive表的一个分区。这里PartitionCount是1

rdd=spark.sql("""show partitions test_dev_db.partition_date_table""").rdd.flatMap(lambda x:x)
newrdd=rdd.map(lambda x : x.replace("/","")).map(lambda x : x.replace("year=","")).map(lambda x : x.replace("month=","-")).map(lambda x : x.replace("day=","-")).map(lambda x : x.split('-'))
max_year=newrdd.map(lambda x : (x[0])).max() 
max_month=newrdd.map(lambda x : x[1]).max()
max_day=newrdd.map(lambda x : x[2]).max()

Answer 2

好吧，Spark支持谓词下推，因此如果您在filter之后提供load，它只会读取符合filter条件的数据。像这样：

spark.read.option("basePath", "/data").load("/data").filter('version === 3)

您可以保留分区信息：）

Answer 3

我认为您必须使用Spark的Window Function，然后找到并过滤出最新版本。

import org.apache.spark.sql.functions.{col, first}
import org.apache.spark.sql.expressions.Window

val windowSpec = Window.partitionBy("year","month").orderBy(col("version").desc)

spark.read.load("/data")
  .withColumn("maxVersion", first("version").over(windowSpec))
  .select("*")
  .filter(col("maxVersion") === col("version"))
  .drop("maxVersion")

让我知道这是否对您有用。

如何仅加载最后一个分区

3 个答案: