Kafka Structured Streaming java.lang.NoClassDefFoundError

时间:2017-09-12 20:10:02

标签: apache-spark pyspark apache-kafka spark-structured-streaming

之前我能够运行Kafka结构流编程。但是我的所有结构流式python程序突然失败并出现错误。我从Spark网站上获取了基本的Kafka结构流编程,但也出现了相同的错误。

  

py4j.protocol.Py4JJavaError:调用o31.load时发生错误。   :java.lang.NoClassDefFoundError:   组织/阿帕奇/卡夫卡/普通/系列化/ ByteArrayDeserializer           在org.apache.spark.sql.kafka010.KafkaSourceProvider $。(KafkaSourceProvider.scala:376)           在org.apache.spark.sql.kafka010.KafkaSourceProvider $。(KafkaSourceProvider.scala)

Spark提交我正在使用

  

spark-submit --packages   org.apache.spark:火花-SQL卡夫卡0-10_2.11:2.2.0   C:\ Users \用户ranjith.gangam \ PycharmProjects \ sparktest \ Structured_streaming.py

这是我从Spark github

获取的代码
spark = SparkSession\
      .builder\
      .appName("StructuredKafkaWordCount")\
      .getOrCreate()

# Create DataSet representing the stream of input lines from kafka
lines = spark\
    .readStream\
    .format("kafka")\
    .option("kafka.bootstrap.servers", bootstrapServers)\
    .option(subscribeType, topics)\
    .load()\
    .selectExpr("CAST(value AS STRING)")

words = lines.select(
    # explode turns each item in an array into a separate row
    explode(
        split(lines.value, ' ')
    ).alias('word')
)

# Generate running word count
wordCounts = words.groupBy('word').count()
# Start running the query that prints the running counts to the console
query = wordCounts\
    .writeStream\
    .outputMode('complete')\
    .format('console')\
    .start()

query.awaitTermination()

1 个答案:

答案 0 :(得分:-1)

你的方式正确,但不幸的是,PySpark尚未支持Kafka 0.10。正如您在SPARK-16534中所看到的那样。

对pySpark的唯一支持是Kafka 0.8直到现在。因此,您可以迁移到spark 0.8或将代码更改为Scala。