Question

我目前正在编写一个需要多个依赖项和配置的 Spark 流应用程序。现在，我必须使用以下代码运行我的 spark 作业：

spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.6,org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.6 --conf spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll --conf spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll mongodbtest.py

它有效，但提交命令相当麻烦，我想将这些依赖项放在我可以简单运行的地方

spark-submit mongodbtest.py

我已经尝试在 PYSPARK_SUBMIT_ARGS 中指定 os.environ 如下：

os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll ' \
                                    '--conf spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll' \
                                    '--packages org.apache.spark:spark-streaming-kafka-0-8:2.4.6'

但是运行 spark-submit mongodbtest.py 并包含在 python 文件中的上述行会导致以下错误：

________________________________________________________________________________________________
  Spark Streaming's Kafka libraries not found in class path. Try one of the following.

  1. Include the Kafka library and its dependencies with in the
     spark-submit command as

     $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.6 ...

  2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
     Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.6.
     Then, include the jar in the spark-submit command as

     $ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

________________________________________________________________________________________________

并且作业失败。在哪里可以包含配置和依赖项信息，以便每次在命令行上提交 spark 作业时都不必包含所有这些信息？谢谢。

提交火花作业时如何从文件中读取火花提交参数？

0 个答案: