So, I have a PySpark program that runs fine with the following command:
spark-submit --jars terajdbc4.jar,tdgssconfig.jar --master local sparkyness.py
And yes its running on local mode and just executing on the master node.
I want to be able to launch my PySpark script though with just:
python sparkyness.py
So, I have added the following lines of code throughtout my PySpark script to facilitate that:
import findspark
findspark.init()
sconf.setMaster("local")
sc._jsc.addJar('/absolute/path/to/tdgssconfig.jar')
sc._jsc.addJar('/absolute/path/to/terajdbc4.jar')
This does not seem to be working though. Everytime I try to run the script with python sparkyness.py
I get the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o48.jdbc.
: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver
What is the difference between spark-submit --jars
and sc._jsc.addJar('myjar.jar')
and what could be causing this issue? Do I need to do more than just sc._jsc.addJar()
?
答案 0 :(得分:1)
构建SparkSession时使用spark.jars
spark = SparkSession.builder.appName('my_awesome')\
.config('spark.jars', '/absolute/path/to/jar')\
.getOrCreate()
相关: Add Jar to standalone pyspark
编辑:我不建议劫持_jsc,因为我不认为处理jar对驱动程序和执行程序的分发并添加到类路径。
示例:我创建了一个没有Hadoop AWS jar的新SparkSession然后尝试访问S3并且出现了错误(与使用sc._jsc.addJar
添加时相同的错误):
Py4JJavaError:调用o35.parquet时发生错误。 : java.io.IOException:没有用于scheme的文件系统:s3
然后我用jar创建了一个会话并得到了一个新的预期错误:
Py4JJavaError:调用o390.parquet时发生错误。 : java.lang.IllegalArgumentException:AWS Access密钥ID和密钥 必须将访问密钥指定为用户名或密码 (分别)s3 URL,或者设置fs.s3.awsAccessKeyId或 fs.s3.awsSecretAccessKey属性(分别)。