简而言之:在Spark 2.0中spark-submit
更喜欢他的Guava库版本(14.0.1) - 但我想使用最近的jar版本(19.0)。
问题:如何说服Spark使用我的pom.xml
文件中提供的版本?
我怀疑:我可以使用spark.driver.userClassPathFirst=true
选项。但它是实验功能(Spark 2.0.0 doc) - 所以也许有更好的解决方案?
问题详细说明:
我正在使用Spark 2.0.0(hadoop2.7)和Elasticsearch 2.3.4。我正在尝试使用非常简单的应用程序,它一起尝试使用Spark Streaming和Elasticsearch。这是:
SparkConf sparkConf = new SparkConf().setAppName("SampleApp");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(500));
jssc.checkpoint("/tmp");
JavaDStream<String> messages = jssc.textFileStream("/some_directory_path");
TransportClient client = TransportClient.builder().build()
.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("localhost"), 9300));
messages.foreachRDD(rdd -> {
XContentBuilder builder = jsonBuilder()
.startObject()
.field("words", "some words")
.endObject();
clientTu.prepareIndex("indexName", "typeName")
.setSource(builder.string())
.get();
});
jssc.start();
jssc.awaitTermination();
该项目是使用Maven构建的。这是pom.xml的一部分
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.0</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>2.3.4</version>
<exclusions>
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.6.2</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<createSourcesJar>true</createSourcesJar>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.abc.App</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
正如你所看到的,我做了一些排除。 一切看起来都很棒,但在使用命令执行后:
spark-submit --class com.abc.App --master local[2] /somePath/superApp-0.0.1-SNAPSHOT.jar
我获得了例外:
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
at org.elasticsearch.threadpool.ThreadPool.<clinit>(ThreadPool.java:190)
at org.elasticsearch.client.transport.TransportClient$Builder.build(TransportClient.java:131)
at com.abc.App.main(App.java:44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
将--conf spark.driver.userClassPathFirst=true
添加到spark-submit
命令后,应用程序似乎正常工作。但我不确定这是管理此问题的正确方法,因为此选项在文档中标记为 experimental 。
换句话说,Spark更喜欢来自运行时环境的库,而忽略了“uber”jar(组装jar)中提供的库。所以我想知道如何以正确的方式改变这种行为?
再次提问:我必须做些什么来确保我的POM中定义的特定jar将在运行时使用(定义版本)?
编辑:在试图同时使用Spark Streaming和Elasticsearch的更复杂的应用程序中,其他库存在同样的问题(例如io.netty:netty
)。在这种情况下,简单的spark.driver.userClassPathFirst
选项激活根本没有帮助。