Question

简而言之：在Spark 2.0中spark-submit更喜欢他的Guava库版本（14.0.1） - 但我想使用最近的jar版本（19.0）。

问题：如何说服Spark使用我的pom.xml文件中提供的版本？

我怀疑：我可以使用spark.driver.userClassPathFirst=true选项。但它是实验功能（Spark 2.0.0 doc） - 所以也许有更好的解决方案？

问题详细说明：

我正在使用Spark 2.0.0（hadoop2.7）和Elasticsearch 2.3.4。我正在尝试使用非常简单的应用程序，它一起尝试使用Spark Streaming和Elasticsearch。这是：

SparkConf sparkConf = new SparkConf().setAppName("SampleApp");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(500));
jssc.checkpoint("/tmp");
JavaDStream<String> messages = jssc.textFileStream("/some_directory_path");

TransportClient client = TransportClient.builder().build()
    .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("localhost"), 9300));

messages.foreachRDD(rdd -> {
    XContentBuilder builder = jsonBuilder()
            .startObject()
            .field("words", "some words")
            .endObject();

    clientTu.prepareIndex("indexName", "typeName")
        .setSource(builder.string())
        .get();
});

jssc.start();
jssc.awaitTermination();

该项目是使用Maven构建的。这是pom.xml的一部分

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.0.0</version>
        <scope>provided</scope>
        <exclusions>
            <exclusion>
                <groupId>com.google.guava</groupId>
                <artifactId>guava</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_2.11</artifactId>
        <version>2.0.0</version>
        <scope>provided</scope>
        <exclusions>
            <exclusion>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.11</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <dependency>
        <groupId>org.elasticsearch</groupId>
        <artifactId>elasticsearch</artifactId>
        <version>2.3.4</version>
        <exclusions>
            <exclusion>
                <groupId>com.google.guava</groupId>
                <artifactId>guava</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-core</artifactId>
        <version>2.6.2</version>
    </dependency>

    <dependency>
        <groupId>com.google.guava</groupId>
        <artifactId>guava</artifactId>
        <version>19.0</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>2.4.3</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <createSourcesJar>true</createSourcesJar>
                        <transformers>
                            <transformer
                                implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
                            <transformer
                                implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                <mainClass>com.abc.App</mainClass>
                            </transformer>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

正如你所看到的，我做了一些排除。一切看起来都很棒，但在使用命令执行后：

spark-submit --class com.abc.App --master local[2]  /somePath/superApp-0.0.1-SNAPSHOT.jar

我获得了例外：

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
    at org.elasticsearch.threadpool.ThreadPool.<clinit>(ThreadPool.java:190)
    at org.elasticsearch.client.transport.TransportClient$Builder.build(TransportClient.java:131)
    at com.abc.App.main(App.java:44)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

将--conf spark.driver.userClassPathFirst=true添加到spark-submit命令后，应用程序似乎正常工作。但我不确定这是管理此问题的正确方法，因为此选项在文档中标记为 experimental 。

换句话说，Spark更喜欢来自运行时环境的库，而忽略了“uber”jar（组装jar）中提供的库。所以我想知道如何以正确的方式改变这种行为？

再次提问：我必须做些什么来确保我的POM中定义的特定jar将在运行时使用（定义版本）？

编辑：在试图同时使用Spark Streaming和Elasticsearch的更复杂的应用程序中，其他库存在同样的问题（例如io.netty:netty）。在这种情况下，简单的spark.driver.userClassPathFirst选项激活根本没有帮助。

Apache Spark 2.0库首选项

0 个答案: