考虑以下简单示例,运行连接到具有4个执行程序的群集的Spark Shell:
scala> val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5, 6), 4).cache.setName("rdd")
rdd: org.apache.spark.rdd.RDD[Int] = rdd ParallelCollectionRDD[0] at parallelize at <console>:27
scala> rdd.count()
res0: Long = 6
scala> val singlePartition = rdd.repartition(1).cache.setName("singlePartition")
singlePartition: org.apache.spark.rdd.RDD[Int] = singlePartition MapPartitionsRDD[4] at repartition at <console>:29
scala> singlePartition.count()
res1: Long = 6
scala> val multiplePartitions = singlePartition.repartition(6).cache.setName("multiplePartitions")
multiplePartitions: org.apache.spark.rdd.RDD[Int] = multiplePartitions MapPartitionsRDD[8] at repartition at <console>:31
scala> multiplePartitions.count()
res2: Long = 6
原始rdd
有4个分区,当我在UI中检查时,它分布在4个执行器中。 singlePartition
RDD显然只包含在一个Executor上。当multiplePartitions
RDD是通过重新分区singlePartition
RDD创建的时候,我希望在4个执行程序中对数据进行洗牌。我看到的是multiplePartitions
有6个分区,但它们都在一个Executor上,与singlePartition
分区的那个相同。
不应该通过重新分区将数据拖放到4个执行程序中吗?