重新分区不会将数据移动到所有节点吗?

时间:2016-04-14 13:26:09

标签: apache-spark

考虑以下简单示例,运行连接到具有4个执行程序的群集的Spark Shell:

scala> val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5, 6), 4).cache.setName("rdd")
rdd: org.apache.spark.rdd.RDD[Int] = rdd ParallelCollectionRDD[0] at parallelize at <console>:27

scala> rdd.count()
res0: Long = 6

rdd

scala> val singlePartition = rdd.repartition(1).cache.setName("singlePartition")
singlePartition: org.apache.spark.rdd.RDD[Int] = singlePartition MapPartitionsRDD[4] at repartition at <console>:29

scala> singlePartition.count()
res1: Long = 6

singlePartition

scala> val multiplePartitions = singlePartition.repartition(6).cache.setName("multiplePartitions")
multiplePartitions: org.apache.spark.rdd.RDD[Int] = multiplePartitions MapPartitionsRDD[8] at repartition at <console>:31

scala> multiplePartitions.count()
res2: Long = 6

enter image description here

原始rdd有4个分区,当我在UI中检查时,它分布在4个执行器中。 singlePartition RDD显然只包含在一个Executor上。当multiplePartitions RDD是通过重新分区singlePartition RDD创建的时候,我希望在4个执行程序中对数据进行洗牌。我看到的是multiplePartitions有6个分区,但它们都在一个Executor上,与singlePartition分区的那个相同。

不应该通过重新分区将数据拖放到4个执行程序中吗?

0 个答案:

没有答案