Spark对dataFrames的分区策略存在困惑

时间:2017-09-13 07:49:46

标签: apache-spark apache-spark-sql

在下面我在所有四个print语句中得到相同数量的分区(200)。初始数据帧(df1)在4列(account_id, schema_name, table_name, column_name)上分区。但后续数据帧仅在3个字段(account_id, schema_name, table_name)上进行分区。有人可以向我解释一下,如果Spark能够保留step1-step4的分区策略,并且在第1步之后不再需要对数据进行洗牌。

val query1: String = "SELECT account_id, schema_name, table_name, 
column_name, COLLECT_SET(u.query_id) AS query_id_set FROM usage_tab 
GROUP BY account_id, schema_name, table_name, column_name"
val df1 = session.sql(query1)
println("1 " + df.rdd.getNumPartitions)


df1.createOrReplaceTempView("wtftempusage")
val query2 = "SELECT DISTINCT account_id, schema_name, table_name 
FROM wtftempusage"
val df2 = session.sql(query2)
println("2 " + df2.rdd.getNumPartitions)


//MyFuncIterator retains all columns for df2 and adds an additional column
val extendedDF = df2.mapPartitions(MyFuncIterator)
println("3 " + extendedDF.rdd.getNumPartitions)


val joinedDF = df1.join(extendedDF, Seq("account_id", "schema_name", "table_name"))
println("4 " + joinedDF.rdd.getNumPartitions)

谢谢, Devj

1 个答案:

答案 0 :(得分:0)

DF API中默认的shuffle分区数为200。

您可以将默认shuffle.partitons设置为较小的数字。说像: sqlContext.setConf(" spark.sql.shuffle.partitions"," 5")