如何构建一个真正的本地Apache Spark“胖”jar。 JRE内存问题?

时间:2017-02-28 06:12:00

标签: java scala apache-spark sbt sbt-assembly

长话短说我有一个使用Spark数据帧和机器学习的应用程序,以及用于前端的ScalaFX。 我想创建一个巨大的“胖”jar,以便它可以在任何带有JVM的机器上运行。

我熟悉程序集sbt插件,研究了几个小时组装jar的方法。下面是我的build.sbt:

lazy val root = (project in file(".")).
  settings(
  scalaVersion := "2.11.8",
  mainClass in assembly := Some("me.projects.MyProject.Main"),
  assemblyJarName in assembly := "MyProject_2.0.jar",
  test in assembly := {}
  )

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" withSources() withJavadoc()
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" withSources() withJavadoc()
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.0.2" withSources() withJavadoc()
libraryDependencies += "joda-time" % "joda-time" % "2.9.4" withJavadoc()
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.1" % "provided"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
libraryDependencies += "org.scalafx" %% "scalafx" % "8.0.92-R10" withSources() withJavadoc()
libraryDependencies += "net.liftweb" %% "lift-json" % "2.6+" withSources() withJavadoc()

EclipseKeys.withSource := true
EclipseKeys.withJavadoc := true

// META-INF discarding
assemblyMergeStrategy in assembly := {
  case PathList("org","aopalliance", xs @ _*) => MergeStrategy.last
  case PathList("javax", "inject", xs @ _*) => MergeStrategy.last
  case PathList("javax", "servlet", xs @ _*) => MergeStrategy.last
  case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
  case PathList("org", "apache", xs @ _*) => MergeStrategy.last
  case PathList("com", "google", xs @ _*) => MergeStrategy.last
  case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last
  case PathList("com", "codahale", xs @ _*) => MergeStrategy.last
  case PathList("com", "yammer", xs @ _*) => MergeStrategy.last
  case "about.html" => MergeStrategy.rename
  case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
  case "META-INF/mailcap" => MergeStrategy.last
  case "META-INF/mimetypes.default" => MergeStrategy.last
  case "plugin.properties" => MergeStrategy.last
  case "log4j.properties" => MergeStrategy.last
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value
    oldStrategy(x)
}

这在我的Linux机器上运行正常,它已安装并配置了spark。在我使用ScalaFX组装罐子并在Windows机器上打开它们之前没有任何问题。但是,此应用程序也使用Spark,它提供以下内容:

ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 259522560 must be at least 471859200. Please increase the heap size using the --driver-memory option or spark.driver.memory in Spark configuration.

我尝试过的事情:

  • 在build sbt
  • 的spark依赖项中包含/不包含%“provided”
  • 在运行时参数中,在Windows机器Java运行时环境设置中,在-Xms中添加越来越大的数字。
  • 在创建SparkConf(在scala代码中)时为spark.executor.driver / memory设置不同的值,如下所示:

    .set(“spark.executor.memory”,“12g”) .set(“spark.executor.driver”,“5g”) .SET( “spark.driver.memory”, “5克”)

应用程序正常工作(在Scala IDE中运行时,使用spark-submit运行时,在linux中打开已组装的jar时)。

如果可能,请告诉我。这是一个小项目,它使用GUI(ScalaFX)对某些数据(Spark)运行几个机器学习操作。因此上面的依赖。

同样,我不打算设置群集或类似的东西。我想通过在任何带有JRE的计算机上运行jar来访问Spark功能。这是一个待展示的小项目。

2 个答案:

答案 0 :(得分:0)

声明SparkConf时尝试使用.set("spark.driver.memory","5g")。当然,如果你有机器5 + g以上的内存。

答案 1 :(得分:0)

原来这是一个相当通用的JVM问题。我没有添加运行时参数,而是通过向Windows系统添加新的环境变量来解决这个问题:

名称:_JAVA_OPTIONS 价值:-Xms512M -Xmx1024M