我是Scala / Spark Streaming和StackOverflow的新手,所以请原谅我的格式。我做了一个Scala应用程序,可以从Kafka Stream中读取日志文件。它可以在IDE中正常运行,但是如果我可以使用spark-submit
来运行它,那该死的。它总是失败并显示:
ClassNotFoundException: org.apache.kafka.common.serialization.ByteArrayDeserializer
异常中引用的行是此代码段中的load命令:
val records = spark
.readStream
.format("kafka") // <-- use KafkaSource
.option("subscribe", kafkaTopic)
.option("kafka.bootstrap.servers", kafkaBroker) // 192.168.4.86:9092
.load()
.selectExpr("CAST(value AS STRING) AS temp")
.withColumn("record", deSerUDF($"temp"))
pom.xml
的相关部分:
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.8</scala.version>
<scala.compat.version>2.11</scala.compat.version>
<spark.version>2.2.1</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.github.scala-incubator.io</groupId>
<artifactId>scala-io-file_2.11</artifactId>
<version>0.4.3-1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.0.0</version>
<!-- version>2.0.0</version -->
</dependency>
注意:我不认为这是相关的,但是我必须使用zip -d BroLogSpark.jar "META-INF/*.SF"
和zip -d BroLogSpark.jar "META-INF/*.DSA"
来了解清单签名的含义。
我的jar文件不包含org.apache.kafka
中的任何一个。我看过几篇帖子,强烈暗示我的版本不匹配,并且我尝试了pom.xml
和spark-submit
的无数变更。每次更改后,我确认它仍在IDE中运行,然后继续尝试在同一系统,同一用户上使用spark-submit
。以下是我最近的尝试,其中我的BroLogSpark.jar
在当前目录中,“ 192.168.4.86:9092配置文件”是输入参数。
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.1,org.apache.kafka:kafka-clients:0.10.0.0 BroLogSpark.jar 192.168.4.86:9092 BroFile
答案 0 :(得分:0)
也添加以下依赖项
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.0.0</version>
</dependency>