如何在本地运行胶水作业?

时间:2019-11-21 08:45:47

标签: scala amazon-web-services apache-spark aws-glue

我有here中所述的设置项目。但是代码:

import com.amazonaws.services.glue.{AWSGlueClientBuilder, GlueContext}
import org.apache.spark.SparkContext
import org.slf4j.LoggerFactory

object MyGlueJob {
  private val logger = LoggerFactory.getLogger(getClass)
  def main(sysArgs: Array[String]) {

    val spark: SparkContext = SparkContext.getOrCreate()
    val glueContext: GlueContext = new GlueContext(spark)
    val awsGlueClient = AWSGlueClientBuilder.defaultClient
  }
}

失败,并显示错误:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/11/21 15:40:32 INFO SparkContext: Running Spark version 2.4.3
19/11/21 15:40:33 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
    at MyGlueJob$.main(MyGlueJob.scala:13)
    at MyGlueJob.main(MyGlueJob.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
19/11/21 15:40:33 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
    at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$postApplicationEnd(SparkContext.scala:2416)
    at org.apache.spark.SparkContext$$anonfun$stop$1.apply$mcV$sp(SparkContext.scala:1931)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:585)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
    at MyGlueJob$.main(MyGlueJob.scala:13)
    at MyGlueJob.main(MyGlueJob.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
19/11/21 15:40:33 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:66)
Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:117)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2544)
    at MyGlueJob$.main(MyGlueJob.scala:13)
    at MyGlueJob.main(MyGlueJob.scala)
    ... 5 more

很明显,应该设置主URL,但是如何从命令行或系统变量设置主URL? (例如,无需触摸代码)

我也[读] --master参数可以解决问题,但是将其添加到args不会执行任何操作(这是Intellij Idea运行配置):

enter image description here

关键问题是在本地运行胶水作业,并且可以在不接触代码的情况下在aws中运行胶水,这可能吗?

1 个答案:

答案 0 :(得分:0)

您可以显式创建spark会话并设置所需的任何参数。但是我不能说这最终将在胶水中起作用。以下是一个本地会话,尽管最终我确实在Glue中运行了它们,但我还是用来在本地测试Spark作业。我只测试纯火花代码。

  lazy val spark: SparkSession = {
    UserGroupInformation.setLoginUser(UserGroupInformation.createRemoteUser("hduser"))
    SparkSession
      .builder()
      .master("local")
      .appName("spark unit test")
      .getOrCreate()
  }
  

关键问题是在本地运行胶水作业,并且可以在不接触代码的情况下在aws中运行胶水,这可能吗?

可以使用dev端点和Zeppelin运行任何代码。参见aws docs