因此,我试图在EMR上运行Spark管道,并且正在创建类似这样的步骤:
// Build the Spark job submission request
val runSparkJob = new StepConfig()
.withName("Run Pipeline")
.withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
.withHadoopJarStep(
new HadoopJarStepConfig()
.withJar(jarS3Path)
.withMainClass("com.example.SparkApp")
)
问题是,当我运行此程序时,遇到类似这样的异常:
org.apache.spark.SparkException: A master URL must be set in your configuration
问题是,我正在尝试找出在哪里指定主URL,但似乎找不到它。我是在设置管道运行步骤时指定它还是需要以某种方式将主IP:port
放入应用程序并在main函数中指定它?
答案 0 :(得分:1)
创建SparkSession实例时,应在应用程序中指定
本地运行示例(Scala代码)
val sparkSessionBuilder = SparkSession
.builder()
.appName(getClass.getSimpleName)
.master("local[*]")
.config("spark.driver.host", "localhost")
您可以在jaceklaskowski.gitbooks.io中找到更多信息 或在spark.apache.org
中启动集群时,应使用command-runner.jar指定step并将其传递给您的jars
val runSparkJob = new StepConfig()
.withName("Run Pipeline")
.withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
.withHadoopJarStep(
new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs("spark-submit",
"--deploy-mode", "cluster",
"--driver-memory", "10G",
"--class", <your_class_to_run>,
"s3://path_to_your_jar")
答案 1 :(得分:1)
在您的spark应用程序中,您可以执行以下操作...是option1
val sparkSessionBuilder = SparkSession
.builder()
.appName(getClass.getSimpleName)
.master("yarn")
如果要将其添加到stepconfig ....是选项2
// Define Spark Application
HadoopJarStepConfig sparkConfig = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs("spark-submit,--deploy-mode,cluster,--master,yarn"),
"--class","com.amazonaws.samples.TestQuery",
"s3://20180205-kh-emr-01/jar/emrtest.jar", "10", "Step Test"); // optional list of arguments
StepConfig customStep = new StepConfig()
.withHadoopJarStep(sparkConfig)
.withName("SparkSQL") ;
我更喜欢选项2,因为它没有在代码中进行硬编码。