SparkContext中的parallelize()方法

时间:2015-11-18 19:24:09

标签: apache-spark

我想了解在numSlices中为parallelize()方法提供不同SparkContext的效果。下面给出的是方法的Syntax

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
(implicit arg0: ClassTag[T]): RDD[T]

我在本地模式

中运行了spark-shell
spark-shell --master local

我的理解是,numSlices决定结果RDD的分区号(在调用sc.parallelize()之后)。请考虑以下几个例子

案例1

scala> sc.parallelize(1 to 9, 1);
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> res0.partitions.size
res2: Int = 1

案例2

scala> sc.parallelize(1 to 9, 2);
res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> res3.partitions.size
res4: Int = 2

案例3

scala> sc.parallelize(1 to 9, 3);
res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
scala> res3.partitions.size
res6: Int = 2

案例4

scala> sc.parallelize(1 to 9, 4);
res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22
scala> res3.partitions.size
res8: Int = 2

问题1 :在案例3 &amp; 案例4 ,我原以为分区大小为3&amp;分别为4,但两种情况的分区大小仅为2。这是什么原因?

问题2 :在每种情况下都有一个与ParallelCollectionRDD[no]相关联的数字。即在案例1中它是ParallelCollectionRDD[0],在案例2中它是ParallelCollectionRDD[1]&amp;等等。这些数字到底意味着什么?

1 个答案:

答案 0 :(得分:22)

问题1 :这是您的错字。您分别致电res3.partitions.size而不是res5res7。当我使用正确的数字执行此操作时,它会按预期工作。

问题2 :这是Spark上下文中RDD的id,用于保持图表的直线。看看当我三次运行相同命令时会发生什么:

scala> sc.parallelize(1 to 9,1)
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22

scala> sc.parallelize(1 to 9,1)
res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22

scala> sc.parallelize(1 to 9,1)
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22

现在有三种不同的RDD具有三种不同的ID。我们可以运行以下内容来检查:

scala> (res0.id, res1.id, res2.id)
res3: (Int, Int, Int) = (0,1,2)