我正在尝试按照instructions provided by AWS在AWS EMR上远程运行pyspark脚本。但是,当我尝试提交脚本时,出现以下异常:
Traceback (most recent call last):
File "/home/aco/src/test_remote_pyspark.py", line 19, in <module>
spark = SparkSession.builder.config(conf=conf).getOrCreate()
File "/home/aco/.local/share/virtualenvs/prototypes-dS3RdFhP/lib/python3.6/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/home/aco/.local/share/virtualenvs/prototypes-dS3RdFhP/lib/python3.6/site-packages/pyspark/context.py", line 349, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/home/aco/.local/share/virtualenvs/prototypes-dS3RdFhP/lib/python3.6/site-packages/pyspark/context.py", line 118, in __init__
conf, jsc, profiler_cls)
File "/home/aco/.local/share/virtualenvs/prototypes-dS3RdFhP/lib/python3.6/site-packages/pyspark/context.py", line 180, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/home/aco/.local/share/virtualenvs/prototypes-dS3RdFhP/lib/python3.6/site-packages/pyspark/context.py", line 288, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/home/aco/.local/share/virtualenvs/prototypes-dS3RdFhP/lib/python3.6/site-packages/py4j/java_gateway.py", line 1525, in __call__
answer, self._gateway_client, None, self._fqn)
File "/home/aco/.local/share/virtualenvs/prototypes-dS3RdFhP/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:160)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:178)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
... 20 more
我将SPARK_HOME设置为我未压缩Spark的目录,我可以看到它确实包含一些Jars:
$ echo $SPARK_HOME
/home/aco/Downloads/spark-2.4.0-bin-hadoop2.7
$ ls $SPARK_HOME/jars | head
activation-1.1.1.jar
aircompressor-0.10.jar
antlr-2.7.7.jar
antlr4-runtime-4.7.jar
antlr-runtime-3.4.jar
aopalliance-1.0.jar
aopalliance-repackaged-2.4.0-b34.jar
apacheds-i18n-2.0.0-M15.jar
apacheds-kerberos-codec-2.0.0-M15.jar
apache-log4j-extras-1.2.17.jar
我真的不知道如何调试它。有什么想法吗?