apache-spark - 数据帧分区修剪

我有一个两个表（table_a，table_b）配置单元，它由build_date（格式为YYYYMMDD的整数）分区，并且它是镶木地板和snappy压缩的。我在我的pyspark程序中使用此表作为

hiveCtx = HiveContext(sc)
df1 = hiveCtx.table('table_a').filter(func.col('build_date') == 20170101)
df2 = hiveCtx.table('table_b')
df3 = df1.join(df2, df1.build_date == df2.build_date)

当我执行df3.explain()时，它会读取table_b中的所有分区。如何让它只读取特定的分区？

我还设置了hive属性集（＆＃39; spark.sql.hive.convertMetastoreParquet＆＃39;，＆＃39; false＆＃39;）

数据帧分区修剪

0 个答案: