为什么Spark应用程序失败" ClassNotFoundException:无法找到数据源:jdbc"作为ub-jar与sbt汇编?

时间:2017-12-21 11:00:23

标签: scala apache-spark sbt apache-spark-sql sbt-assembly

我尝试使用sbt 1.0.4和sbt-assembly 0.14.6组装Spark应用程序。

Spark应用程序在IntelliJ IDEA或spark-submit中启动时工作正常,但如果我使用命令行运行已组装的超级jar(Windows 10中的cmd):

java -Xmx1024m -jar my-app.jar

我得到以下异常:

  

线程中的异常" main" java.lang.ClassNotFoundException:找不到数据源:jdbc。请在http://spark.apache.org/third-party-projects.html

找到套餐

Spark应用程序如下所示。

package spark.main

import java.util.Properties    
import org.apache.spark.sql.SparkSession

object Main {

    def main(args: Array[String]) {
        val connectionProperties = new Properties()
        connectionProperties.put("user","postgres")
        connectionProperties.put("password","postgres")
        connectionProperties.put("driver", "org.postgresql.Driver")

        val testTable = "test_tbl"

        val spark = SparkSession.builder()
            .appName("Postgres Test")
            .master("local[*]")
            .config("spark.hadoop.fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
            .config("spark.sql.warehouse.dir", System.getProperty("java.io.tmpdir") + "swd")
            .getOrCreate()

        val dfPg = spark.sqlContext.read.
            jdbc("jdbc:postgresql://localhost/testdb",testTable,connectionProperties)

        dfPg.show()
    }
}

以下是build.sbt

name := "apache-spark-scala"

version := "0.1-SNAPSHOT"

scalaVersion := "2.11.8"

mainClass in Compile := Some("spark.main.Main")

libraryDependencies ++= {
    val sparkVer = "2.1.1"
    val postgreVer = "42.0.0"
    val cassandraConVer = "2.0.2"
    val configVer = "1.3.1"
    val logbackVer = "1.7.25"
    val loggingVer = "3.7.2"
    val commonsCodecVer = "1.10"
    Seq(
        "org.apache.spark" %% "spark-sql" % sparkVer,
        "org.apache.spark" %% "spark-core" % sparkVer,
        "com.datastax.spark" %% "spark-cassandra-connector" % cassandraConVer,
        "org.postgresql" % "postgresql" % postgreVer,
        "com.typesafe" % "config" % configVer,
        "commons-codec" % "commons-codec" % commonsCodecVer,
        "com.typesafe.scala-logging" %% "scala-logging" % loggingVer,
        "org.slf4j" % "slf4j-api" % logbackVer
    )
}

dependencyOverrides ++= Seq(
    "io.netty" % "netty-all" % "4.0.42.Final",
    "commons-net" % "commons-net" % "2.2",
    "com.google.guava" % "guava" % "14.0.1"
)

assemblyMergeStrategy in assembly := {
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard
    case x => MergeStrategy.first
}

有没有人有任何想法,为什么?

[UPDATE]

从官方GitHub存储库获取的配置可以解决问题:

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) =>
    xs map {_.toLowerCase} match {
      case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
        MergeStrategy.discard
      case ps @ (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
          MergeStrategy.discard
      case "services" :: _ =>  MergeStrategy.filterDistinctLines
      case _ => MergeStrategy.first
    }
    case _ => MergeStrategy.first
}

1 个答案:

答案 0 :(得分:2)

问题几乎是Why does format("kafka") fail with "Failed to find data source: kafka." with uber-jar?,其他OP使用Apache Maven来创建一个超级jar,而这里关于sbt(sbt-assembly插件的配置是准确的)。

数据源的简称(又名别名),例如{ "name": "angular-src", "version": "0.0.0", "license": "MIT", "angular-cli": {}, "scripts": { "ng": "ng", "start": "ng serve", "test": "ng test", "pree2e": "webdriver-manager update --standalone false --gecko false", "e2e": "protractor" }, "private": true, "dependencies": { "@angular/animation": "^4.0.0-beta.8", "@angular/common": "^2.3.1", "@angular/compiler": "^2.3.1", "@angular/core": "^2.3.1", "@angular/forms": "^2.3.1", "@angular/http": "^2.3.1", "@angular/platform-browser": "^2.3.1", "@angular/platform-browser-dynamic": "^2.3.1", "@angular/router": "^3.3.1", "angular2-flash-messages": "^2.0.5", "core-js": "^2.4.1", "rxjs": "^5.0.1", "ts-helpers": "^1.1.1", "zone.js": "^0.7.2" }, "devDependencies": { "@angular/compiler-cli": "^2.3.1", "@types/jasmine": "2.5.38", "@types/node": "^6.0.42", "angular-cli": "1.0.0-beta.28.3", "codelyzer": "~2.0.0-beta.1", "jasmine-core": "2.5.2", "jasmine-spec-reporter": "2.5.0", "karma": "1.2.0", "karma-chrome-launcher": "^2.0.0", "karma-cli": "^1.0.1", "karma-jasmine": "^1.0.2", "karma-remap-istanbul": "^0.2.1", "protractor": "~4.0.13", "ts-node": "1.2.1", "tslint": "^4.3.0", "typescript": "~2.0.3" } } jdbc仅在相应的kafka注册META-INF/services/org.apache.spark.sql.sources.DataSourceRegister时才可用。

对于DataSourceRegister别名,Spark SQL使用META-INF/services/org.apache.spark.sql.sources.DataSourceRegister和以下条目(还有其他条目):

jdbc

That's what ties jdbc alias使用数据源。

并且您已通过以下org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider 将其从超级罐中排除。

assemblyMergeStrategy

注意assemblyMergeStrategy in assembly := { case PathList("META-INF", xs @ _*) => MergeStrategy.discard case x => MergeStrategy.first } 您只需case PathList("META-INF", xs @ _*)。这是根本原因。

只是检查“基础架构”是否可用,您可以使用MergeStrategy.discard数据源的完全限定名称(而不是别名),试试这个:

jdbc

由于缺少spark.read. format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider"). load("jdbc:postgresql://localhost/testdb") 等选项,您会看到其他问题,但是...... 我们正在离题

解决方案是url所有MergeStrategy.concat(这将创建包含所有数据源的超级jar,包括META-INF/services/org.apache.spark.sql.sources.DataSourceRegister数据源。)

jdbc