我将HDP-2.6.3.0与Spark2软件包2.2.0一起使用。
我正在尝试使用Structured Streaming API编写Kafka使用者,但是在将作业提交到群集后我收到以下错误:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:553)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:89)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:89)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:198)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:90)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:90)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
at com.example.KafkaConsumer.main(KafkaConsumer.java:21)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$anonfun$22$anonfun$apply$14.apply(DataSource.scala:537)
at org.apache.spark.sql.execution.datasources.DataSource$anonfun$22$anonfun$apply$14.apply(DataSource.scala:537)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$anonfun$22.apply(DataSource.scala:537)
at org.apache.spark.sql.execution.datasources.DataSource$anonfun$22.apply(DataSource.scala:537)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:537)
... 17 more
关注spark-submit
命令:
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode client \
--class com.example.KafkaConsumer \
--executor-cores 2 \
--executor-memory 512m \
--driver-memory 512m \
sample-kafka-consumer-0.0.1-SNAPSHOT.jar
我的java代码:
package com.example;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class KafkaConsumer {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("kafkaConsumerApp")
.getOrCreate();
Dataset<Row> ds = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "dog.mercadoanalitico.com.br:6667")
.option("subscribe", "my-topic")
.load();
}
}
的pom.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>sample-kafka-consumer</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<dependencies>
<!-- spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<!-- kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.1.0</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>local-maven-repo</id>
<url>file:///${project.basedir}/local-maven-repo</url>
</repository>
</repositories>
<build>
<!-- Include resources folder in the .jar -->
<resources>
<resource>
<directory>${basedir}/src/main/resources</directory>
</resource>
</resources>
<plugins>
<!-- Plugin to compile the source. -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<!-- Plugin to include all the dependencies in the .jar and set the main class. -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<!-- This filter is to workaround the problem caused by included signed jars.
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
-->
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.example.KafkaConsumer</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
[更新] UBER-JAR
在pom.xml中使用的配置下面生成uber-jar
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<!-- This filter is to workaround the problem caused by included signed jars.
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
-->
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.example.KafkaConsumer</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
答案 0 :(得分:15)
kafka
数据源是external模块,默认情况下不适用于Spark应用程序。
您必须在pom.xml
中将其定义为依赖项(正如您所做的那样),但这只是在Spark应用程序中使用它的第一步。
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
有了这种依赖关系,你必须决定是否要创建一个所谓的 uber-jar ,这个 uber-jar 将所有依赖项完全捆绑在一起(这会产生一个相当大的jar文件并提交时间更长)或使用--packages
(或不太灵活的--jars
)选项在spark-submit
时添加相关性。
(还有其他选项,比如在Hadoop HDFS上存储所需的jar或使用Hadoop特定于发行版的方法来定义Spark应用程序的依赖关系,但让我们保持简单)
我建议先使用--packages
,并且仅在其工作时考虑其他选项。
使用spark-submit --packages
包含 spark-sql-kafka-0-10 模块,如下所示。
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
根据需要包含其他命令行选项。
由于处理META-INF
目录的方式,将所有依赖项包含在所谓的 uber-jar 中可能并不总是有效。
要使kafka
数据源正常工作(以及其他一般数据源),您必须确保所有数据源中的META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
合并(不是{{1或replace
或您使用的任何策略。
first
数据源使用自己的META-INF/services/org.apache.spark.sql.sources.DataSourceRegister注册org.apache.spark.sql.kafka010.KafkaSourceProvider作为kafka
格式的数据源提供商。
答案 1 :(得分:1)
对于uber-jar,将ServicesResourceTransformer添加到阴影插件对我有用。
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
答案 2 :(得分:1)
最高答案是正确的,这为我解决了这个问题:
assemblyMergeStrategy in assembly := {
case "reference.conf" => MergeStrategy.concat
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat
case PathList("META-INF", xs@_*) => MergeStrategy.discard
case _ => MergeStrategy.first
}
答案 3 :(得分:0)
即使我有类似的问题,当我们从2.2-> 2.3升级Cloudera-Spark版本时,问题也开始出现。
问题是:我的超级jar META-INF / serives / org.apache.spark.sql.sources.DataSourceRegister被其他jar中存在的类似文件覆盖。因此,它无法在“ DataSourceRegister”文件中找到KafkaConsumer条目。
解决方案: 修改POM.xml对我有帮助。
<configuration>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>
META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
</resource>
</transformer>
</transformers>
答案 4 :(得分:0)
我的解决方案有所不同,我直接在Submit-job命令上指定spark-sql-kafka软件包:
.\bin\spark-submit --master local --class "org.myspark.KafkaStream" --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 <path_to_jar>
相关:http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
答案 5 :(得分:0)
我遇到了同样的错误。我花了几天的时间才弄清楚。当您从maven存储库(尤其是“ spark-sql-kafka”)复制依赖项时,它包含以下行:
<scope> provided </scope>
解决方案是删除此行,以便依赖项将在默认的“编译”范围内运行。如果您使用SBT,也是如此。为防万一,如果有其他依赖项,可能也应该删除它。
答案 6 :(得分:0)
我遇到了示例问题,但是使用 gradle 和 shadowJar。添加后它起作用了:
shadowJar {
mergeServiceFiles()
}
assemble.dependsOn shadowJar