在下面我在所有四个print语句中得到相同数量的分区(200)。初始数据帧(df1)在4列(account_id, schema_name, table_name, column_name)
上分区。但后续数据帧仅在3个字段(account_id, schema_name, table_name)
上进行分区。有人可以向我解释一下,如果Spark能够保留step1-step4的分区策略,并且在第1步之后不再需要对数据进行洗牌。
val query1: String = "SELECT account_id, schema_name, table_name,
column_name, COLLECT_SET(u.query_id) AS query_id_set FROM usage_tab
GROUP BY account_id, schema_name, table_name, column_name"
val df1 = session.sql(query1)
println("1 " + df.rdd.getNumPartitions)
df1.createOrReplaceTempView("wtftempusage")
val query2 = "SELECT DISTINCT account_id, schema_name, table_name
FROM wtftempusage"
val df2 = session.sql(query2)
println("2 " + df2.rdd.getNumPartitions)
//MyFuncIterator retains all columns for df2 and adds an additional column
val extendedDF = df2.mapPartitions(MyFuncIterator)
println("3 " + extendedDF.rdd.getNumPartitions)
val joinedDF = df1.join(extendedDF, Seq("account_id", "schema_name", "table_name"))
println("4 " + joinedDF.rdd.getNumPartitions)
谢谢, Devj
答案 0 :(得分:0)
DF API中默认的shuffle分区数为200。
您可以将默认shuffle.partitons设置为较小的数字。说像: sqlContext.setConf(" spark.sql.shuffle.partitions"," 5")