为什么使用Kafka的Spark Streaming应用程序失败了" ClassNotFoundException:org.apache.spark.streaming.kafka.KafkaRDDPartition"?

时间:2016-12-11 06:33:51

标签: scala apache-spark apache-kafka spark-streaming

我将Spark Streaming与Apache Kafka一起使用。

val directKafkaStream = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder ](
  ssc, kafkaParams, topics)
val events = directKafkaStream.flatMap(x=>{
  val data = JSONObject.fromObject(x._2)
  Some(data)

})
val dbIndex = 1
val clickHashKey = "app::users::click"

val userClicks = events.map(x=>(x.getString("userid"),x.getInt("click_count"))).reduceByKey(_+_)
userClicks.foreachRDD(partitionOfRecords=>partitionOfRecords.foreach(pair=>{
  val userid = pair._1
  val clickCount = pair._2
  val jedis = RedisClient.pool.getResource
  jedis.select(dbIndex)
  jedis.hincrBy(clickHashKey, userid, clickCount)
  RedisClient.pool.returnResource(jedis)

}))
ssc.start()
ssc.awaitTermination()

失败并出现以下异常:

16/12/11 14:17:20 INFO DAGScheduler: ShuffleMapStage 146 (map at UserClickCountAnalysis.scala:75) failed in 0.068 s
16/12/11 14:17:20 INFO DAGScheduler: Job 73 failed: foreachRDD at UserClickCountAnalysis.scala:76, took 0.073045 s
16/12/11 14:17:20 ERROR JobScheduler: Error running job streaming job 1481437040000 ms.0

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 146.0 failed 4 times, most recent failure: Lost task 0.3 in stage 146.0 (TID 295, 10.211.55.12): java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaRDDPartition
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:274)
    at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
16/12/11 14:17:25 INFO JobScheduler: Added jobs for time 1481437045000 ms
16/12/11 14:17:25 INFO JobScheduler: Starting job streaming job 1481437045000 ms.0 from job set of time 1481437045000 ms
16/12/11 14:17:25 INFO SparkContext: Starting job: foreachRDD at UserClickCountAnalysis.scala:76
16/12/11 14:17:25 INFO DAGScheduler: Registering RDD 298 (map at UserClickCountAnalysis.scala:75)
16/12/11 14:17:25 INFO DAGScheduler: Got job 74 (foreachRDD at UserClickCountAnalysis.scala:76) with 2 output partitions (allowLocal=false)
16/12/11 14:17:25 INFO DAGScheduler: Final stage: ResultStage 149(foreachRDD at UserClickCountAnalysis.scala:76)
16/12/11 14:17:25 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 148)
16/12/11 14:17:25 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 148)
16/12/11 14:17:25 INFO DAGScheduler: Submitting ShuffleMapStage 148 (MapPartitionsRDD[298] at map at UserClickCountAnalysis.scala:75), which has no missing parents
16/12/11 14:17:25 INFO MemoryStore: ensureFreeSpace(3880) called with curMem=42510, maxMem=2061647216
16/12/11 14:17:25 INFO MemoryStore: Block broadcast_74 stored as values in memory (estimated size 3.8 KB, free 1966.1 MB)
16/12/11 14:17:25 INFO MemoryStore: ensureFreeSpace(2194) called with curMem=46390, maxMem=2061647216
16/12/11 14:17:25 INFO MemoryStore: Block broadcast_74_piece0 stored as bytes in memory (estimated size 2.1 KB, free 1966.1 MB)
16/12/11 14:17:25 INFO BlockManagerInfo: Added broadcast_74_piece0 in memory on 192.168.1.103:56006 (size: 2.1 KB, free: 1966.1 MB)
16/12/11 14:17:25 INFO SparkContext: Created broadcast 74 from broadcast at DAGScheduler.scala:874
16/12/11 14:17:25 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 148 (MapPartitionsRDD[298] at map at UserClickCountAnalysis.scala:75)
16/12/11 14:17:25 INFO TaskSchedulerImpl: Adding task set 148.0 with 1 tasks
16/12/11 14:17:25 INFO TaskSetManager: Starting task 0.0 in stage 148.0 (TID 296, 10.211.55.12, ANY, 1271 bytes)
16/12/11 14:17:25 WARN TaskSetManager: Lost task 0.0 in stage 148.0 (TID 296, 10.211.55.12): java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaRDDPartition
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

我的pom.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>phnasis</groupId>
<artifactId>sparkstreamingUserClick</artifactId>
<version>1.0-SNAPSHOT</version>

<dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
    <!--<dependency>-->
        <!--<groupId>org.apache.spark</groupId>-->
        <!--<artifactId>spark-core_2.10</artifactId>-->
        <!--<version>1.4.0</version>-->
    <!--</dependency>-->

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming-kafka_2.10</artifactId>
        <version>1.4.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_2.10</artifactId>
        <version>1.4.0</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.10 -->
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka_2.10</artifactId>
        <version>0.8.2.1</version>
    </dependency>


    <dependency>
        <groupId>redis.clients</groupId>
        <artifactId>jedis</artifactId>
        <version>2.9.0</version>
        <type>jar</type>
        <scope>compile</scope>
    </dependency>



</dependencies>
<build>
    <sourceDirectory>src/main/java</sourceDirectory>
    <testSourceDirectory>src/test/java</testSourceDirectory>
    <plugins>
        <!--
                     Bind the maven-assembly-plugin to the package phase
          this will create a jar file without the storm dependencies
          suitable for deployment to a cluster.
         -->
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
                <archive>
                    <manifest>
                        <mainClass></mainClass>
                    </manifest>
                </archive>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

2 个答案:

答案 0 :(得分:1)

鉴于您的pom.xml具有以下内容:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming-kafka_2.10</artifactId>
  <version>1.4.0</version>
</dependency>

猜测问题在于您提交Spark Streaming应用程序以供执行。

您必须使用以下两种可能方式之一包含对Spark环境的类路径的依赖性(这在很大程度上取决于您使用的Spark的版本):

  1. spark-submit --packages是一个以逗号分隔的jars maven坐标列表,包含在驱动程序和执行程序类路径中,例如

    ./bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.0.2
    
  2. (不推荐)在你的jar中组装Spark依赖项,最终成为 uberjar 与此和其他依赖项(除非你用provided排除它们)。

  3. 建议的方法是使用选项1,但这需要最新的Spark版本(支持--packages),并且由于Spark版本的更改,也会将不同的spark-streaming-kafka模块拆分为{{ 1}}和0.8

答案 1 :(得分:0)

Spark流媒体需要为Kafka提供单独的工件。在你的情况下

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka_2.10</artifactId>
    <version>1.4.0</version>
</dependency>

PROTIP:只要有机会,你应该考虑将Spark版本更新到最新的 2.0.2