结构化流2.1.0 kafka驱动程序在YARN上使用--packages但在独立群集模式下遇到问题

时间:2017-01-26 23:52:29

标签: apache-spark spark-structured-streaming

目前,我们正在测试结构化流媒体Kafka驱动程序。我们使用--packages'org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0'在YARN(2.7.3)上提交,没有问题。但是,当我们尝试使用deploy mode = cluster在spark standalone上启动时,我们得到了

ClassNotFoundException: Failed to find data source: kafka

错误,即使启动命令已将Kafka jar添加到-Dspark.jars(见下文),后续日志进一步说明这些jar已成功添加。

所有10个罐子都存在于所有节点上的/home/spark/.ivy2中。我手动检查KafkaSourceProvider中是否存在org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar类。我进一步确认,通过在没有--packages选项的情况下在YARN中启动驱动程序并使用--jars option手动添加所有10个罐子,罐子没有问题。 节点运行Scala 2.11.8。

任何见解都表示赞赏。

  1. spark-submit自动添加的罐子:

    -Dspark.jars=file:/home/spark/.ivy2/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar,file:/home/spark/.ivy2/jars/org.apache.kafka_kafka-clients-0.10.0.1.jar,file:/home/spark/.ivy2/jars/org.apache.spark_spark-tags_2.11-2.1.0.jar,file:/home/spark/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar,file:/home/spark/.ivy2/jars/net.jpountz.lz4_lz4-1.3.0.jar,file:/home/spark/.ivy2/jars/org.xerial.snappy_snappy-java-1.1.2.6.jar,file:/home/spark/.ivy2/jars/org.slf4j_slf4j-api-1.7.16.jar,file:/home/spark/.ivy2/jars/org.scalatest_scalatest_2.11-2.2.6.jar,file:/home/spark/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar,file:/home/spark/.ivy2/jars/org.scala-lang.modules_scala-xml_2.11-1.0.2.jar
    
  2. Spark信息消息似乎已加载这些jar:

    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar at spark://10.102.22.23:50513/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar with timestamp 1485467844922
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.apache.kafka_kafka-clients-0.10.0.1.jar at spark://10.102.22.23:50513/jars/org.apache.kafka_kafka-clients-0.10.0.1.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.apache.spark_spark-tags_2.11-2.1.0.jar at spark://10.102.22.23:50513/jars/org.apache.spark_spark-tags_2.11-2.1.0.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar at spark://10.102.22.23:50513/jars/org.spark-project.spark_unused-1.0.0.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/net.jpountz.lz4_lz4-1.3.0.jar at spark://10.102.22.23:50513/jars/net.jpountz.lz4_lz4-1.3.0.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.xerial.snappy_snappy-java-1.1.2.6.jar at spark://10.102.22.23:50513/jars/org.xerial.snappy_snappy-java-1.1.2.6.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.slf4j_slf4j-api-1.7.16.jar at spark://10.102.22.23:50513/jars/org.slf4j_slf4j-api-1.7.16.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.scalatest_scalatest_2.11-2.2.6.jar at spark://10.102.22.23:50513/jars/org.scalatest_scalatest_2.11-2.2.6.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar at spark://10.102.22.23:50513/jars/org.scala-lang_scala-reflect-2.11.8.jar with timestamp 1485467844924
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.scala-lang.modules_scala-xml_2.11-1.0.2.jar at spark://10.102.22.23:50513/jars/org.scala-lang.modules_scala-xml_2.11-1.0.2.jar with timestamp 1485467844924
    
  3. 错误消息:

    Caused by: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:569)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:197)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
        at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
        at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
        at com.dematic.labs.analytics.diagnostics.spark.drivers.StructuredStreamingSignalCount$.main(StructuredStreamingSignalCount.scala:76)
        at com.dematic.labs.analytics.diagnostics.spark.drivers.StructuredStreamingSignalCount.main(StructuredStreamingSignalCount.scala)
        ... 6 more
    Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
    

1 个答案:

答案 0 :(得分:1)

这是一个已知问题。见https://issues.apache.org/jira/browse/SPARK-4160

现在您可以使用客户端模式作为解决方法。