我正在尝试使用Scala与Kafka和Spark做一个用例。我使用kafka库构建了一个使用者和一个生产者,现在我正在构建数据处理器来使用Spark计算单词。这是我的build.sbt:
name := """scala-akka-stream-kafka"""
version := "1.0"
// scalaVersion := "2.12.4"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.kafka" %% "kafka" % "0.10.2.0",
"org.apache.kafka" % "kafka-streams" % "0.10.2.0",
"org.apache.spark" %% "spark-core" % "2.2.0",
"org.apache.spark" %% "spark-streaming" % "2.2.0",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.0.0")
dependencyOverrides ++= Seq(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.6.5",
"com.fasterxml.jackson.core" % "jackson-module-scala" % "2.6.5")
resolvers ++= Seq(
"Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"
)
resolvers += Resolver.sonatypeRepo("releases")
我的字数统计数据处理器在val wordMap = words.map( word => (word, 1))
行上有一些错误:
package com.spark.streams
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Durations, StreamingContext}
import scala.collection.mutable
object WordCountSparkStream extends App {
val kafkaParam = new mutable.HashMap[String, String]()
kafkaParam.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
kafkaParam.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
kafkaParam.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
kafkaParam.put(ConsumerConfig.GROUP_ID_CONFIG, "group1")
kafkaParam.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
kafkaParam.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true")
val conf = new SparkConf().setMaster("local[2]").setAppName("WordCountSparkStream")
// Read messages in batch of 5 seconds
val sparkStreamingContext = new StreamingContext(conf, Durations.seconds(5))
//Configure Spark to listen messages in topic test
val topicList = List("streams-plaintext-input")
// Read value of each message from Kafka and return it
val messageStream = KafkaUtils.createDirectStream(sparkStreamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicList, kafkaParam))
val lines = messageStream.map(consumerRecord => consumerRecord.value().asInstanceOf[String])
// Break every message into words and return list of words
val words = lines.flatMap(_.split(" "))
// Take every word and return Tuple with (word,1)
val wordMap = words.map( word => (word, 1))
// Count occurance of each word
val wordCount = wordMap.reduceByKey((first, second) => first + second)
//Print the word count
wordCount.print()
sparkStreamingContext.start()
sparkStreamingContext.awaitTermination()
// "streams-wordcount-output"
}
但这不是编译错误。甚至没有lib冲突。它说我无法反序列化。但我正在使用String deserializer,这正是我的产品所产生的。
17/12/12 17:02:50 INFO DAGScheduler: Submitting 8 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCountSparkStream.scala:37) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7))
17/12/12 17:02:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 8 tasks
17/12/12 17:02:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 4710 bytes)
17/12/12 17:02:50 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 4710 bytes)
17/12/12 17:02:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/12/12 17:02:50 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/12/12 17:02:50 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.ClassNotFoundException: scala.None$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1863)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1746)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2037)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2282)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2206)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2064)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:309)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
答案 0 :(得分:-1)
试试这个:
fork:=true
适合我,但我不知道怎么样〜