我试图运行示例目录中给出的python spark流式作业 -
https://spark.apache.org/docs/2.1.1/streaming-programming-guide.html
"""
Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
Usage: kafka_wordcount.py <zk> <topic>
To run this on your local machine, you need to setup Kafka and create a producer first, see
http://kafka.apache.org/documentation.html#quickstart
and then run the example
`$ bin/spark-submit --jars \
external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar \
examples/src/main/python/streaming/kafka_wordcount.py \
localhost:2181 test`
"""
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: kafka_wordcount.py <zk> <topic>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
ssc = StreamingContext(sc, 1)
zkQuorum, topic = sys.argv[1:]
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
# counts.pprint()
ssc.start()
ssc.awaitTermination()
我将spark-streaming-kafka-0-8_2.11-2.1.0.jar下载到我的本地目录并运行了我的spark-submit命令
bin/spark-submit --jars ../external/spark-streaming-kafka*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
我收到以下错误 -
Exception in thread "Thread-3" java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition
答案 0 :(得分:0)
您需要使用spark-streaming-kafka-assembly
jar,而不是spark-streaming-kafka
。程序集jar具有所有依赖项(包括kafka客户端)。