使用外部jar文件运行PySpark作业时找不到库

时间:2018-07-23 13:21:49

标签: python apache-spark pyspark mqtt

我有以下代码的PySpark作业InitiatorSpark.py

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Test") \
    .getOrCreate()

lines = (spark
             .readStream
             .format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
             .option("topic","my_topic")
             .load("tcp://{}".format("127.0.0.1:1883")))

我将其运行如下:

spark-submit --jars lib/spark-sql-streaming-mqtt_2.11-2.2.1.jar InitiatorSpark.py

Spark启动,但随后在第.load("tcp://{}".format("127.0.0.1:1883")))行失败,并显示以下消息:

Caused by: java.lang.ClassNotFoundException: org.eclipse.paho.client.mqttv3.MqttClientPersistence

尽管我提供了正确的JAR文件,但似乎找不到类MqttClientPersistence。在lib内部,有两个文件:

spark-streaming-mqtt_2.11-2.2.1-sources.jar 
spark-streaming-mqtt_2.11-2.2.1.jar

我的设置有什么问题?

1 个答案:

答案 0 :(得分:0)

我可以通过在spark-submit命令中添加3个JAR文件来运行此代码:

spark-submit --jars lib/spark-streaming-mqtt_2.11-2.2.1.jar,lib/spark-sql-streaming-mqtt_2.11-2.2.1.jar,lib/org.eclipse.paho.client.mqttv3-1.2.0.jar InitiatorSpark.py