Pyspark - what are the differences in behavior between `spark-submit --jars` and `sc._jsc.addJar('myjar.jar')`

时间:2018-02-01 18:05:12

标签: apache-spark pyspark

So, I have a PySpark program that runs fine with the following command:

spark-submit --jars terajdbc4.jar,tdgssconfig.jar --master local sparkyness.py

And yes its running on local mode and just executing on the master node.

I want to be able to launch my PySpark script though with just:

python sparkyness.py

So, I have added the following lines of code throughtout my PySpark script to facilitate that:

import findspark
findspark.init()



sconf.setMaster("local")



sc._jsc.addJar('/absolute/path/to/tdgssconfig.jar')
sc._jsc.addJar('/absolute/path/to/terajdbc4.jar')

This does not seem to be working though. Everytime I try to run the script with python sparkyness.py I get the error:

py4j.protocol.Py4JJavaError: An error occurred while calling o48.jdbc.
: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

What is the difference between spark-submit --jars and sc._jsc.addJar('myjar.jar') and what could be causing this issue? Do I need to do more than just sc._jsc.addJar()?

1 个答案:

答案 0 :(得分:1)

构建SparkSession时使用spark.jars

spark = SparkSession.builder.appName('my_awesome')\
    .config('spark.jars', '/absolute/path/to/jar')\
    .getOrCreate()

相关: Add Jar to standalone pyspark

编辑:我不建议劫持_jsc,因为我不认为处理jar对驱动程序和执行程序的分发并添加到类路径。

示例:我创建了一个没有Hadoop AWS jar的新SparkSession然后尝试访问S3并且出现了错误(与使用sc._jsc.addJar添加时相同的错误):

  

Py4JJavaError:调用o35.parquet时发生错误。 :   java.io.IOException:没有用于scheme的文件系统:s3

然后我用jar创建了一个会话并得到了一个新的预期错误:

  

Py4JJavaError:调用o390.parquet时发生错误。 :   java.lang.IllegalArgumentException:AWS Access密钥ID和密钥   必须将访问密钥指定为用户名或密码   (分别)s3 URL,或者设置fs.s3.awsAccessKeyId或   fs.s3.awsSecretAccessKey属性(分别)。