为什么spark会创建空分区以及默认分区如何工作?

时间:2018-01-18 05:15:58

标签: apache-spark rdd partitioning

我通过指定分区数从文本文件创建RDD。但它给了我不同于指定分区的分区数。

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at textFile at <console>:27 
scala> people.getNumPartitions 
res47: Int = 1 

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 1) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at textFile at <console>:27 
scala> people.getNumPartitions 
res36: Int = 1 

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 2) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at textFile at <console>:27 
scala> people.getNumPartitions 
res37: Int = 2 

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 3) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at textFile at <console>:27 
scala> people.getNumPartitions 
res38: Int = 3 

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 4) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:27 
scala> people.getNumPartitions 
res39: Int = 4 

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 5) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[58] at textFile at <console>:27 
scala> people.getNumPartitions 
res40: Int = 6 

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 6) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at textFile at <console>:27 
scala> people.getNumPartitions 
res41: Int = 7 

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 7) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at textFile at <console>:27 
scala> people.getNumPartitions 
res42: Int = 8 

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 8) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[64] at textFile at <console>:27 
scala> people.getNumPartitions 
res43: Int = 9 

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 9) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at textFile at <console>:27 
scala> people.getNumPartitions 
res44: Int = 11 

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 10) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at textFile at <console>:27 
scala> people.getNumPartitions 
res45: Int = 11 

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 11) 
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at textFile at <console>:27 
scala> people.getNumPartitions 
res46: Int = 13

文件/home/pvikash/data/test.txt的内容是:

This is a test file. 
Will be used for rdd partition.

我试图理解为什么分区数量在这里发生变化,如果我们有小数据(可以放入一个分区),那么为什么spark会创建空分区?

任何解释都将不胜感激。

1 个答案:

答案 0 :(得分:1)

在spark中函数textFile调用hadoopFile函数。

如果你检查hadoopFile的签名是什么样的

def hadoopFile[K, V](path: String,
                 inputFormatClass: Class[_ <: InputFormat[K, V]],
                 keyClass: Class[K],
                 valueClass: Class[V],
                 minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = {

因此,您指定的分区是RDD将具有的最小分区数。但是,每个分区的大小将由文件输入格式中的不同函数computeSplitSize确定。

因此,当您设置并行性时,您可以保证至少获得那么多分区,但是确切的数字可能比您的更大。

有一个很好的blog与此相关。