pyspark-kafka集成:缺少lib

时间:2018-12-07 15:51:51

标签: python apache-spark pyspark apache-kafka

为了遵守Kafka的项目,我正在遵循Databricks在该地址中的指示:

Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher

代码:

# coding: utf-8
import sys
import os,time
sys.path.append("/usr/local/lib/python2.7/dist-packages")
from pyspark.sql import SparkSession,Row
from pyspark import SparkContext,SQLContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql.types import *
import pyspark.sql.functions
import json

spark = SparkSession.builder.appName("Kakfa-test").getOrCreate()
spark.sparkContext.setLogLevel('WARN')


trainingSchema = StructType([
  StructField("code",StringType(),True),
  StructField("ean",StringType(),True),
  StructField("name",StringType(),True),
  StructField("description",StringType(),True),
  StructField("category",StringType(),True),
  StructField("attributes",StringType(),True)
])
trainingDF = spark.createDataFrame(sc.emptyRDD(),trainingSchema)

broker, topic = 
['kafka.partner.stg.some.domain:9092','hybris.products']

df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", 
"kafka.partner.stg.some.domain:9092") \
.option("subscribe", "hybris.products") \
.option("startingOffsets", "earliest") \
.load()

我的Hadoop版本是2.6,Spark的版本是2.3.0

带有spark-submit的命令行是:

spark-submit --jars jars/spark-sql-kafka-0-10_2.11-2.3.0.jar kafka-test-002.py

错误消息:

  

Py4JJavaError:调用o48.load时发生错误。   :java.lang.NoClassDefFoundError:org / apache / kafka / common / serialization / ByteArrayDeserializer          在org.apache.spark.sql.kafka010.KafkaSourceProvider $。(KafkaSourceProvider.scala:413)          在org.apache.spark.sql.kafka010.KafkaSourceProvider $。(KafkaSourceProvider.scala)          在org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:360)          在org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:64)          在org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:231)          在org.apache.spark.sql.execution.datasources.DataSource.sourceInfo $ lzycompute(DataSource.scala:94)          在org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:94)          在org.apache.spark.sql.execution.streaming.StreamingRelation $ .apply(StreamingRelation.scala:33)          在org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:170)          在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处          在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)          在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)          在java.lang.reflect.Method.invoke(Method.java:498)          在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)          在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)          在py4j.Gateway.invoke(Gateway.java:282)          在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)          在py4j.commands.CallCommand.execute(CallCommand.java:79)          在py4j.GatewayConnection.run(GatewayConnection.java:214)          在java.lang.Thread.run(Thread.java:745)   引起原因:java.lang.ClassNotFoundException:org.apache.kafka.common.serialization.ByteArrayDeserializer          在java.net.URLClassLoader.findClass(URLClassLoader.java:381)          在java.lang.ClassLoader.loadClass(ClassLoader.java:424)          在java.lang.ClassLoader.loadClass(ClassLoader.java:357)

正如您可以在我上面提到的 jar 文件中提到的网站上检查的那样,该文件是完全相同的文件。所以,我不知道为什么会这样。也许没有提到另一个模块?我真的在这里迷路了

1 个答案:

答案 0 :(得分:1)

提到的JAR并不包括kafka客户端的所有依赖项。您应该使用--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0(如“部署:https://spark.apache.org/docs/2.3.0/structured-streaming-kafka-integration.html#deploying部分的文档中所述)