我想了解在numSlices
中为parallelize()
方法提供不同SparkContext
的效果。下面给出的是方法的Syntax
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
(implicit arg0: ClassTag[T]): RDD[T]
我在本地模式
中运行了spark-shellspark-shell --master local
我的理解是,numSlices
决定结果RDD的分区号(在调用sc.parallelize()
之后)。请考虑以下几个例子
案例1
scala> sc.parallelize(1 to 9, 1);
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> res0.partitions.size
res2: Int = 1
案例2
scala> sc.parallelize(1 to 9, 2);
res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> res3.partitions.size
res4: Int = 2
案例3
scala> sc.parallelize(1 to 9, 3);
res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
scala> res3.partitions.size
res6: Int = 2
案例4
scala> sc.parallelize(1 to 9, 4);
res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22
scala> res3.partitions.size
res8: Int = 2
问题1 :在案例3 &amp; 案例4 ,我原以为分区大小为3
&amp;分别为4
,但两种情况的分区大小仅为2
。这是什么原因?
问题2 :在每种情况下都有一个与ParallelCollectionRDD[no]
相关联的数字。即在案例1中它是ParallelCollectionRDD[0]
,在案例2中它是ParallelCollectionRDD[1]
&amp;等等。这些数字到底意味着什么?
答案 0 :(得分:22)
问题1 :这是您的错字。您分别致电res3.partitions.size
而不是res5
和res7
。当我使用正确的数字执行此操作时,它会按预期工作。
问题2 :这是Spark上下文中RDD的id,用于保持图表的直线。看看当我三次运行相同命令时会发生什么:
scala> sc.parallelize(1 to 9,1)
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> sc.parallelize(1 to 9,1)
res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> sc.parallelize(1 to 9,1)
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
现在有三种不同的RDD具有三种不同的ID。我们可以运行以下内容来检查:
scala> (res0.id, res1.id, res2.id)
res3: (Int, Int, Int) = (0,1,2)