Spark jar中的重新编译源代码类是否破坏了sbt的合并?

时间:2017-10-19 15:21:59

标签: apache-spark sbt sbt-assembly

尝试使用sbt创建一个胖jar会出现如下错误:

java.lang.RuntimeException: deduplicate: different file contents found in the following:
C:\Users\db\.ivy2\cache\org.apache.spark\spark-network-common_2.10\jars\spark-network-common_2.10-1.6.3.jar:com/google/common/base/Function.class
C:\Users\db\.ivy2\cache\com.google.guava\guava\bundles\guava-14.0.1.jar:com/google/common/base/Function.class

有很多类,这只是一个例子。 Guava 14.0.1是两个jar中的Function.class的版本:

[info]  +-com.google.guava:guava:14.0.1
...
[info]  | | +-com.google.guava:guava:14.0.1

这意味着sbt / ivy不会选择一个作为新版本,但是罐子里的尺寸和日期不同,这可能导致上述错误:

$ jar tvf /c/Users/db/.ivy2/cache/org.apache.spark/spark-network-common_2.10/jars/spark-network-common_2.10-1.6.3.jar | grep "com/google/common/base/Function.class"
   549 Wed Nov 02 16:03:20 CDT 2016 com/google/common/base/Function.class

$ jar tvf /c/Users/db/.ivy2/cache/com.google.guava/guava/bundles/guava-14.0.1.jar  | grep "com/google/common/base/Function.class"
   543 Thu Mar 14 19:56:52 CDT 2013 com/google/common/base/Function.class

似乎Apache正在从源代码重新编译Function.class,而不是将该类包含在最初编译的类中。 这是对这里发生的事情的正确理解吗?现在,可以使用sbt排除重新编译的类,但是存在 一种构建jar的方法,而不是明确地排除每个包含重新编译的源名称的jar?不包括罐子明确导致一些事情 沿着下面片段的线条,这似乎让我走错了道路:

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.3"
  excludeAll(
    ExclusionRule(organization = "com.twitter"),
    ExclusionRule(organization = "org.apache.spark", name = "spark-network-common_2.10"),
    ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-client"),
    ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-hdfs"),
    ExclusionRule(organization = "org.tachyonproject", name = "tachyon-client"),
    ExclusionRule(organization = "commons-beanutils", name = "commons-beanutils"),
    ExclusionRule(organization = "commons-collections", name = "commons-collections"),
    ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-yarn-api"),
    ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-yarn-common"),
    ExclusionRule(organization = "org.apache.curator", name = "curator-recipes")
  )
,
libraryDependencies += "org.apache.spark" %% "spark-network-common" % "1.6.3" exclude("com.google.guava", "guava"),
libraryDependencies += "org.apache.spark" %% "spark-graphx" % "1.6.3",
libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging-slf4j" % "2.1.2",
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0" exclude("com.google.guava", "guava"),
libraryDependencies += "com.google.guava" % "guava" % "14.0.1",
libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.11",
libraryDependencies += "org.json4s" %% "json4s-ext" % "3.2.11",
libraryDependencies += "com.rabbitmq" % "amqp-client" % "4.1.1",
libraryDependencies += "commons-codec" % "commons-codec" % "1.10",

如果这是错误的道路,那么更清洁的方式是什么?

1 个答案:

答案 0 :(得分:1)

  

如果这是错误的道路,那么更清洁的方式是什么?

更干净的方法是根本不打包spark-core,当你在目标机器上安装Spark时它可用,并且在运行时可用于你的应用程序(你通常可以在/usr/lib/spark/jars)。

您应该将这些火花依赖关系标记为% provided。这可以帮助您避免包装这些罐子造成的许多冲突。