Question

我想了解spark在加入数据时如何计算分区数。我正在使用带有纱线和Hadoop的spark 1.6.2。

我有一个代码

val df1 = ..... 
val df2 =  .... .cache() //small cached dataframe
//cartesian join
val joined =  df1 join(broadcast(df2)) persist(StorageLevel.MEMORY_AND_DISK_SER) 

println(df1.rdd.partitions.size )  //prints 10
println(df2.rdd.partitions.size )  //prints 28
println(joined.rdd.partitions.size )  //prints 33

有人可以解释为什么结果是33？

修改

== Optimized Logical Plan ==
Project [key1#6L,key2#9,key3#21L,temp_index#33L,CASE WHEN (key1_type#2 = business) THEN ((rand#179 * 2000.0) + 10000.0) ELSE ((rand#179 * 2000.0) + 5000.0) AS amount#180]
+- Project [key1_type#2,key1#6L,key2#9,key3#21L,temp_index#33L,randn(-5800712378829663042) AS rand#179]
   +- Join Inner, None
      :- InMemoryRelation [key1_type#2,key1#6L,key2#9,key3#21L], true, 10000, StorageLevel(true, true, false, false, 1), BroadcastNestedLoopJoin BuildRight, Inner, None, None
      +- BroadcastHint
         +- InMemoryRelation [temp_index#33L], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#32L AS temp_index#33L], None

== Physical Plan ==
Project [key1#6L,key2#9,key3#21L,temp_index#33L,CASE WHEN (key1_type#2 = business) THEN ((rand#179 * 2000.0) + 10000.0) ELSE ((rand#179 * 2000.0) + 5000.0) AS amount#180]
+- Project [key1_type#2,key1#6L,key2#9,key3#21L,temp_index#33L,randn(-5800712378829663042) AS rand#179]
   +- BroadcastNestedLoopJoin BuildRight, Inner, None
      :- InMemoryColumnarTableScan [key1_type#2,key1#6L,key2#9,key3#21L], InMemoryRelation [key1_type#2,key1#6L,key2#9,key3#21L], true, 10000, StorageLevel(true, true, false, false, 1), BroadcastNestedLoopJoin BuildRight, Inner, None, None
      +- InMemoryColumnarTableScan [temp_index#33L], InMemoryRelation [temp_index#33L], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#32L AS temp_index#33L], None

加入运算符的Spark分区逻辑数

0 个答案: