我目前正在编写一个需要多个依赖项和配置的 Spark 流应用程序。现在,我必须使用以下代码运行我的 spark 作业:
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.6,org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.6 --conf spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll --conf spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll mongodbtest.py
它有效,但提交命令相当麻烦,我想将这些依赖项放在我可以简单运行的地方
spark-submit mongodbtest.py
我已经尝试在 PYSPARK_SUBMIT_ARGS
中指定 os.environ
如下:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll ' \
'--conf spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll' \
'--packages org.apache.spark:spark-streaming-kafka-0-8:2.4.6'
但是运行 spark-submit mongodbtest.py
并包含在 python 文件中的上述行会导致以下错误:
________________________________________________________________________________________________
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.6 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.6.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
________________________________________________________________________________________________
并且作业失败。在哪里可以包含配置和依赖项信息,以便每次在命令行上提交 spark 作业时都不必包含所有这些信息?谢谢。