Spark:如何指定保存RDD的执行程序数?

时间:2014-08-31 07:00:59

标签: apache-spark rdd

我试图通过将RDD分发给尽可能多的执行程序来最大化并行性。据我所知,用户可以使用重新分区,合并或并行化来更改分区数。但我无法找到一种方法来更改保存分区的执行程序数。任何人都可以暗示如何做到这一点吗?

2 个答案:

答案 0 :(得分:2)

当你启动你的火花应用程序时。有一个参数--num-executors来指定你想要多少个执行程序,并且并行地, - executor-cores指定在每个执行程序中可以并行执行多少个任务。

在您的情况下,您可以指定大量执行程序,每个执行程序只有1个executor-core。然后例如你有10个分区和10个执行器,那么每个执行器将被分配一个任务来处理一个分区。

答案 1 :(得分:0)

在命令行中直接键入spark-submit,您将获得手动。它包含以下内容



 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.




这取决于您的部署模式。您必须使用上面的特定参数将脚本提交给spark,以相应地定义执行程序的数量。