我想了解spark在加入数据时如何计算分区数。 我正在使用带有纱线和Hadoop的spark 1.6.2。
我有一个代码
val df1 = .....
val df2 = .... .cache() //small cached dataframe
//cartesian join
val joined = df1 join(broadcast(df2)) persist(StorageLevel.MEMORY_AND_DISK_SER)
println(df1.rdd.partitions.size ) //prints 10
println(df2.rdd.partitions.size ) //prints 28
println(joined.rdd.partitions.size ) //prints 33
有人可以解释为什么结果是33?
修改
== Optimized Logical Plan ==
Project [key1#6L,key2#9,key3#21L,temp_index#33L,CASE WHEN (key1_type#2 = business) THEN ((rand#179 * 2000.0) + 10000.0) ELSE ((rand#179 * 2000.0) + 5000.0) AS amount#180]
+- Project [key1_type#2,key1#6L,key2#9,key3#21L,temp_index#33L,randn(-5800712378829663042) AS rand#179]
+- Join Inner, None
:- InMemoryRelation [key1_type#2,key1#6L,key2#9,key3#21L], true, 10000, StorageLevel(true, true, false, false, 1), BroadcastNestedLoopJoin BuildRight, Inner, None, None
+- BroadcastHint
+- InMemoryRelation [temp_index#33L], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#32L AS temp_index#33L], None
== Physical Plan ==
Project [key1#6L,key2#9,key3#21L,temp_index#33L,CASE WHEN (key1_type#2 = business) THEN ((rand#179 * 2000.0) + 10000.0) ELSE ((rand#179 * 2000.0) + 5000.0) AS amount#180]
+- Project [key1_type#2,key1#6L,key2#9,key3#21L,temp_index#33L,randn(-5800712378829663042) AS rand#179]
+- BroadcastNestedLoopJoin BuildRight, Inner, None
:- InMemoryColumnarTableScan [key1_type#2,key1#6L,key2#9,key3#21L], InMemoryRelation [key1_type#2,key1#6L,key2#9,key3#21L], true, 10000, StorageLevel(true, true, false, false, 1), BroadcastNestedLoopJoin BuildRight, Inner, None, None
+- InMemoryColumnarTableScan [temp_index#33L], InMemoryRelation [temp_index#33L], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#32L AS temp_index#33L], None