Spark Streaming在本地模式下工作,但在客户端/集群模式下,“阶段失败”和“无法初始化类”

时间:2018-11-09 18:22:00

标签: apache-spark spark-streaming

我有一个Spark + Kafka流媒体应用程序,可以在本地模式下正常运行,但是当我尝试在yarn + local / cluster模式下启动它时,会出现类似以下的错误

我经常看到的第一个错误是

WARN TaskSetManager: Lost task 1.1 in stage 3.0 (TID 9, ip-xxx-24-129-36.ec2.internal, executor 2): java.lang.NoClassDefFoundError: Could not initialize class TestStreaming$
        at TestStreaming$$anonfun$main$1$$anonfun$apply$1.apply(TestStreaming.scala:60)
        at TestStreaming$$anonfun$main$1$$anonfun$apply$1.apply(TestStreaming.scala:59)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

下一个错误是

  

ERROR JobScheduler:运行作业流作业1541786030000 ms.0时出错

跟着

  

java.lang.NoClassDefFoundError:无法初始化类

Spark版本2.1.0 斯卡拉2.11 Kafka版本10

启动时,部分代码会将配置加载到main中。我在罐子运行后使用-conf传递此配置文件(请参见下文)。我不太确定,但也必须将此配置也传递给执行程序吗?

我使用以下命令启动流式应用程序。一个显示本地模式,另一个显示客户端模式。

  

runJar = myProgram.jar   loggerPath = / path / to / log4j.properties

     

mainClass = TestStreaming

     

logger = -DPHDTKafkaConsumer.app.log4j = $ loggerPath

     

confFile = application.conf

     

-----------本地模式----------   SPARK_KAFKA_VERSION = 0.10 nohup spark2-submit --driver-java-options   “ $ logger” --conf“ spark.executor.extraJavaOptions = $ logger” --class   $ mainClass --master local [4] $ runJar -conf $ confFile&

     

-----------客户端模式----------   SPARK_KAFKA_VERSION = 0.10 nohup spark2-submit --master yarn --conf>“ spark.executor.extraJavaOptions = $ logger” --conf>“ spark.driver.extraJavaOptions = $ logger” --class $ mainClass $ runJar -conf> $ confFile&

这是我下面的代码。已与之抗争了一个多星期。

import Util.UtilFunctions
import UtilFunctions.config
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.log4j.Logger


object TestStreaming extends Serializable {

  @transient lazy val logger: Logger = Logger.getLogger(getClass.getName)

  def main(args: Array[String]) {
    logger.info("Starting app")

    UtilFunctions.loadConfig(args)
    UtilFunctions.loadLogger()

    val props: Map[String, String] = setKafkaProperties()

    val topic = Set(config.getString("config.TOPIC_NAME"))

    val conf = new SparkConf()
      .setAppName(config.getString("config.SPARK_APP_NAME"))
      .set("spark.streaming.backpressure.enabled", "true")

    val spark = SparkSession.builder()
      .config(conf)
      .getOrCreate()

    val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
    ssc.sparkContext.setLogLevel("INFO")
    ssc.checkpoint(config.getString("config.SPARK_CHECKPOINT_NAME"))

    val kafkaStream = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topic, props))
    val distRecordsStream = kafkaStream.map(record => (record.key(), record.value()))
    distRecordsStream.window(Seconds(10), Seconds(10))
    distRecordsStream.foreachRDD(rdd => {
      if(!rdd.isEmpty()) {
        rdd.foreach(record => {
          println(record._2) //value from kafka
        })
      }
    })

    ssc.start()
    ssc.awaitTermination()
    ssc.stop()
  }

  def setKafkaProperties(): Map[String, String] = {

    val deserializer = "org.apache.kafka.common.serialization.StringDeserializer"
    val zookeeper = config.getString("config.ZOOKEEPER")
    val offsetReset = config.getString("config.OFFSET_RESET")
    val brokers = config.getString("config.BROKERS")
    val groupID = config.getString("config.GROUP_ID")
    val autoCommit = config.getString("config.AUTO_COMMIT")
    val maxPollRecords = config.getString("config.MAX_POLL_RECORDS")
    val maxPollIntervalms = config.getString("config.MAX_POLL_INTERVAL_MS")

    val props = Map(
      "bootstrap.servers" -> brokers,
      "zookeeper.connect" -> zookeeper,
      "group.id" -> groupID,
      "key.deserializer" -> deserializer,
      "value.deserializer" -> deserializer,
      "enable.auto.commit" -> autoCommit,
      "auto.offset.reset" -> offsetReset,
      "max.poll.records" -> maxPollRecords,
      "max.poll.interval.ms" -> maxPollIntervalms)
    props
  }

}

0 个答案:

没有答案