我试图通过将RDD分发给尽可能多的执行程序来最大化并行性。据我所知,用户可以使用重新分区,合并或并行化来更改分区数。但我无法找到一种方法来更改保存分区的执行程序数。任何人都可以暗示如何做到这一点吗?
答案 0 :(得分:2)
当你启动你的火花应用程序时。有一个参数--num-executors来指定你想要多少个执行程序,并且并行地, - executor-cores指定在每个执行程序中可以并行执行多少个任务。
在您的情况下,您可以指定大量执行程序,每个执行程序只有1个executor-core。然后例如你有10个分区和10个执行器,那么每个执行器将被分配一个任务来处理一个分区。
答案 1 :(得分:0)
在命令行中直接键入spark-submit,您将获得手动。它包含以下内容
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.

这取决于您的部署模式。您必须使用上面的特定参数将脚本提交给spark,以相应地定义执行程序的数量。