我尝试使用sbt 1.0.4和sbt-assembly 0.14.6组装Spark应用程序。
Spark应用程序在IntelliJ IDEA或spark-submit
中启动时工作正常,但如果我使用命令行运行已组装的超级jar(Windows 10中的cmd):
java -Xmx1024m -jar my-app.jar
我得到以下异常:
线程中的异常" main" java.lang.ClassNotFoundException:找不到数据源:jdbc。请在http://spark.apache.org/third-party-projects.html
找到套餐
Spark应用程序如下所示。
package spark.main
import java.util.Properties
import org.apache.spark.sql.SparkSession
object Main {
def main(args: Array[String]) {
val connectionProperties = new Properties()
connectionProperties.put("user","postgres")
connectionProperties.put("password","postgres")
connectionProperties.put("driver", "org.postgresql.Driver")
val testTable = "test_tbl"
val spark = SparkSession.builder()
.appName("Postgres Test")
.master("local[*]")
.config("spark.hadoop.fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
.config("spark.sql.warehouse.dir", System.getProperty("java.io.tmpdir") + "swd")
.getOrCreate()
val dfPg = spark.sqlContext.read.
jdbc("jdbc:postgresql://localhost/testdb",testTable,connectionProperties)
dfPg.show()
}
}
以下是build.sbt
。
name := "apache-spark-scala"
version := "0.1-SNAPSHOT"
scalaVersion := "2.11.8"
mainClass in Compile := Some("spark.main.Main")
libraryDependencies ++= {
val sparkVer = "2.1.1"
val postgreVer = "42.0.0"
val cassandraConVer = "2.0.2"
val configVer = "1.3.1"
val logbackVer = "1.7.25"
val loggingVer = "3.7.2"
val commonsCodecVer = "1.10"
Seq(
"org.apache.spark" %% "spark-sql" % sparkVer,
"org.apache.spark" %% "spark-core" % sparkVer,
"com.datastax.spark" %% "spark-cassandra-connector" % cassandraConVer,
"org.postgresql" % "postgresql" % postgreVer,
"com.typesafe" % "config" % configVer,
"commons-codec" % "commons-codec" % commonsCodecVer,
"com.typesafe.scala-logging" %% "scala-logging" % loggingVer,
"org.slf4j" % "slf4j-api" % logbackVer
)
}
dependencyOverrides ++= Seq(
"io.netty" % "netty-all" % "4.0.42.Final",
"commons-net" % "commons-net" % "2.2",
"com.google.guava" % "guava" % "14.0.1"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
有没有人有任何想法,为什么?
[UPDATE]
从官方GitHub存储库获取的配置可以解决问题:
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) =>
xs map {_.toLowerCase} match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
MergeStrategy.discard
case ps @ (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
MergeStrategy.discard
case "services" :: _ => MergeStrategy.filterDistinctLines
case _ => MergeStrategy.first
}
case _ => MergeStrategy.first
}
答案 0 :(得分:2)
问题几乎是Why does format("kafka") fail with "Failed to find data source: kafka." with uber-jar?,其他OP使用Apache Maven来创建一个超级jar,而这里关于sbt(sbt-assembly插件的配置是准确的)。
数据源的简称(又名别名),例如{
"name": "angular-src",
"version": "0.0.0",
"license": "MIT",
"angular-cli": {},
"scripts": {
"ng": "ng",
"start": "ng serve",
"test": "ng test",
"pree2e": "webdriver-manager update --standalone false --gecko false",
"e2e": "protractor"
},
"private": true,
"dependencies": {
"@angular/animation": "^4.0.0-beta.8",
"@angular/common": "^2.3.1",
"@angular/compiler": "^2.3.1",
"@angular/core": "^2.3.1",
"@angular/forms": "^2.3.1",
"@angular/http": "^2.3.1",
"@angular/platform-browser": "^2.3.1",
"@angular/platform-browser-dynamic": "^2.3.1",
"@angular/router": "^3.3.1",
"angular2-flash-messages": "^2.0.5",
"core-js": "^2.4.1",
"rxjs": "^5.0.1",
"ts-helpers": "^1.1.1",
"zone.js": "^0.7.2"
},
"devDependencies": {
"@angular/compiler-cli": "^2.3.1",
"@types/jasmine": "2.5.38",
"@types/node": "^6.0.42",
"angular-cli": "1.0.0-beta.28.3",
"codelyzer": "~2.0.0-beta.1",
"jasmine-core": "2.5.2",
"jasmine-spec-reporter": "2.5.0",
"karma": "1.2.0",
"karma-chrome-launcher": "^2.0.0",
"karma-cli": "^1.0.1",
"karma-jasmine": "^1.0.2",
"karma-remap-istanbul": "^0.2.1",
"protractor": "~4.0.13",
"ts-node": "1.2.1",
"tslint": "^4.3.0",
"typescript": "~2.0.3"
}
}
或jdbc
仅在相应的kafka
注册META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
时才可用。
对于DataSourceRegister
别名,Spark SQL使用META-INF/services/org.apache.spark.sql.sources.DataSourceRegister和以下条目(还有其他条目):
jdbc
That's what ties jdbc
alias使用数据源。
并且您已通过以下org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
将其从超级罐中排除。
assemblyMergeStrategy
注意assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
您只需case PathList("META-INF", xs @ _*)
。这是根本原因。
只是检查“基础架构”是否可用,您可以使用MergeStrategy.discard
数据源的完全限定名称(而不是别名),试试这个:
jdbc
由于缺少spark.read.
format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider").
load("jdbc:postgresql://localhost/testdb")
等选项,您会看到其他问题,但是...... 我们正在离题。
解决方案是url
所有MergeStrategy.concat
(这将创建包含所有数据源的超级jar,包括META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
数据源。)
jdbc