在EMR上运行时如何指定Spark配置?

时间:2019-07-02 19:06:23

标签: amazon-web-services apache-spark amazon-emr aws-step-config

因此,我试图在EMR上运行Spark管道,并且正在创建类似这样的步骤:

// Build the Spark job submission request
val runSparkJob = new StepConfig()
  .withName("Run Pipeline")
  .withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
  .withHadoopJarStep(
    new HadoopJarStepConfig()
      .withJar(jarS3Path)
      .withMainClass("com.example.SparkApp")
  )

问题是,当我运行此程序时,遇到类似这样的异常:

org.apache.spark.SparkException: A master URL must be set in your configuration

问题是,我正在尝试找出在哪里指定主URL,但似乎找不到它。我是在设置管道运行步骤时指定它还是需要以某种方式将主IP:port放入应用程序并在main函数中指定它?

2 个答案:

答案 0 :(得分:1)

创建SparkSession实例时,应在应用程序中指定

本地运行示例(Scala代码)

val sparkSessionBuilder = SparkSession
      .builder()
      .appName(getClass.getSimpleName)
      .master("local[*]")
      .config("spark.driver.host", "localhost")

您可以在jaceklaskowski.gitbooks.io中找到更多信息 或在spark.apache.org

启动集群时,应使用command-runner.jar指定step并将其传递给您的jars

val runSparkJob = new StepConfig()
  .withName("Run Pipeline")
  .withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
  .withHadoopJarStep(
    new HadoopJarStepConfig()
      .withJar("command-runner.jar")
      .withArgs("spark-submit",
           "--deploy-mode", "cluster",
           "--driver-memory", "10G",
           "--class", <your_class_to_run>,
           "s3://path_to_your_jar")

To submit work to Spark using the SDK for Java

答案 1 :(得分:1)

在您的spark应用程序中,您可以执行以下操作...是option1

val sparkSessionBuilder = SparkSession
      .builder()
      .appName(getClass.getSimpleName)
      .master("yarn")

如果要将其添加到stepconfig ....是选项2

// Define Spark Application
        HadoopJarStepConfig sparkConfig = new HadoopJarStepConfig()
            .withJar("command-runner.jar")
            .withArgs("spark-submit,--deploy-mode,cluster,--master,yarn"),
                    "--class","com.amazonaws.samples.TestQuery",
                    "s3://20180205-kh-emr-01/jar/emrtest.jar", "10", "Step Test"); // optional list of arguments

StepConfig customStep = new StepConfig()
                 .withHadoopJarStep(sparkConfig)  
                 .withName("SparkSQL") ;

我更喜欢选项2,因为它没有在代码中进行硬编码。