Question

我的/ accounts / *目录有7个文件，每个文件的大小小于块大小。

我想知道Spark如何计算分区。 “textFile”方法的第二个参数是Spark的分区数量提示，但它是否有基于它决定分区数量的逻辑。

对于10作为输入，它给出15个分区，对于20作为输入，它给出25个分区

如何计算？

问候！

scala> var accounts= sc.textFile("/accounts/*",3)

scala> accounts.toDebugString
15/10/12 02:41:45 INFO mapred.FileInputFormat: Total input paths to  process : 7
res0: String = 
(7) /accounts/* MapPartitionsRDD[1] at textFile at <console>:21 []
 |  /accounts/* HadoopRDD[0] at textFile at <console>:21 []

scala> var accounts= sc.textFile("/accounts/*",10)
scala> accounts.toDebugString
15/10/12 02:42:01 INFO mapred.FileInputFormat: Total input paths to process : 7
res1: String = 
 (15) /accounts/* MapPartitionsRDD[3] at textFile at <console>:21 []
 |   /accounts/* HadoopRDD[2] at textFile at <console>:21 []

scala> var accounts= sc.textFile("/accounts/*",20)
scala> accounts.toDebugString
15/10/12 02:42:01 INFO mapred.FileInputFormat: Total input paths to process : 7
res1: String = 
 (23) /accounts/* MapPartitionsRDD[3] at textFile at <console>:21 []
 |   /accounts/* HadoopRDD[2] at textFile at <console>:21 []

Answer 1

Spark不计算分区数。它只是将提示传递给Hadoop库。 Hadoop用它做什么？这取决于。查看特定InputFormat getSplits方法的文档（或更可能是代码）。

例如，对于TextInputFormat，代码位于FileInputFormat.getSplits。它非常复杂，取决于几个配置参数。

Answer 2

通常，从HDFS读取时，spark会创建一个RDD，每个HDFS块都有一个分区。

有关分区如何流经管道以及您可以调整here的内容的更多详细信息。

Spark如何计算分区数？

2 个答案: