Question

我是新来的火花。我面临一个非常奇怪的问题。尝试开发一个从Web api获取JSON数据的spark应用程序，并将数据放入Hive表中。我将我的计划分为两部分：

1.) Access.java - Connects to the Web Api and gets the Json
2.) test.scala - parses the json and writes the json to a hive table(This is where spark comes into picture)

这是我的代码：

1）Access.java：

public class Access {
    JSONArray getToes(){

        CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
        credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials("xxxxxx","xxxxxxx"));
        HttpClientContext localContext = HttpClientContext.create();
        localContext.setCredentialsProvider(credentialsProvider);

        HttpHost proxy = new HttpHost("xxxxxxxxxxxxxxxxxxx", 8080, "http");
        RequestConfig config = RequestConfig.custom().setProxy(proxy).build();

        HttpClient httpClient = HttpClients.custom().setDefaultCredentialsProvider(credentialsProvider).build();

        HttpGet toesGet = new HttpGet("https://api.riskrecon.com/v0.1/toes");

        toesGet.setConfig(config);
        toesGet.setHeader("Accept","Application/Json");
        toesGet.addHeader("Authorization","Bearer xxxxxxxx");

        try {
            HttpResponse toes = httpClient.execute(toesGet);
            System.out.println(toes.getStatusLine());
            //System.out.println(toes.getAllHeaders().toString());
            System.out.println(toes.getEntity().toString());

            if(toes.getStatusLine().getStatusCode() == 200) {
                JSONParser parser = new JSONParser();
                JSONArray arr = (JSONArray) parser.parse(EntityUtils.toString(toes.getEntity()));
                System.out.println(arr);
                return arr;
            }
        } catch (Exception e){
            e.printStackTrace();
        }
        return null;
    }
}

2.）test.scala：

object test {
  def main(args: Array[String]): Unit = {
    println("Hello, world!")
    val acc = new Access
    val arr = acc.getToes()
    print(arr)

    //System.setProperty("hadoop.home.dir", "path") //For running on windows
    val conf = new SparkConf().setAppName("RiskRecon")//.setMaster("local")
    val sc = new SparkContext(conf)
    sc.setLogLevel("ERROR")
    val hiveContext = new HiveContext(sc)
    val obj = arr.get(0)
    val rdd = sc.parallelize(Seq(arr.toString))
    val dataframe = hiveContext.read.json(rdd)

    println("Size of Json "+arr.size())
    println("Size of dataframe "+dataframe.count())
    println(dataframe.show())
    print(dataframe.getClass)
  }
}

我通过maven添加所有依赖项。这是我的POM.xml：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>Test_Scala_project</groupId>
  <artifactId>xxxxxxxx</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>TestJar</name>
  <build>
    <sourceDirectory>src</sourceDirectory>
    <plugins>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.5.1</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpclient</artifactId>
      <version>4.5.2</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpcore</artifactId>
      <version>4.4.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.googlecode.json-simple/json-simple -->
    <dependency>
      <groupId>com.googlecode.json-simple</groupId>
      <artifactId>json-simple</artifactId>
      <version>1.1.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.10</artifactId>
    <version>1.6.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 -->
   <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.10</artifactId>
    <version>1.6.0</version>
   </dependency>
 </dependencies>
</project>

现在，我可以打包jar并在Windows上进行spark-submit。但是当我在Linux上spark-submit时，我会遇到错误。这是我的命令：

   spark-submit --verbose --master yarn --class test app.jar

它给我一个错误说：Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/client/protocol/HttpClientContext

然后我添加所需的罐子并尝试再次运行：

spark-submit --verbose --master yarn --class test --jars httpclient-4.5.2.jar,httpcore-4.4.4.jar app.jar

现在，我得到了这个奇怪的错误：

Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.<clinit>(SSLConnectionSocketFactory.java:144)
        at org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:966)
        at Access.getToes(Access.java:29)
        at test$.main(test.scala:9)
        at test.main(test.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我已经搜索过类似的错误，但没有任何内容存在火花。但我找到了几个类似的链接：

http://apache-spark-developers-list.1001551.n3.nabble.com/Dependency-hell-in-Spark-applications-td8264.html

NoSuchMethodError while running AWS S3 client on Spark while javap shows otherwise

该错误看起来像是一个错误的依赖问题。在上面的第一个链接中，他提到我们内部使用httpclient 4.1.2。我有两件事情：

1。）如果在spark中有一个httpclient的默认库，为什么它会让我发现'classnotfound exception＆＃39;当我在没有添加http库的情况下运行应用程序时？

2.。）我试图在命令中包含httpclient和httpcore的4.1.2版本并再次运行。这是命令：

spark-submit --verbose --master yarn --class test --jars httpclient-4.1.2.jar,httpcore-4.1.2.jar app.jar

它再次给出了ClassNotFound错误：

java.lang.NoClassDefFoundError: org/apache/http/client/protocol/HttpClientContext

这太奇怪了。它给了我不同版本的库的不同错误。我也尝试在我的pom.xml中更改我的http版本。我也试图完全从pom.xml中删除http依赖项，因为火花库内部有它（从我目前为止所知），但它仍然给我同样的错误。

我还尝试将Access.java打包为一个单独的应用程序，并使用java -jar命令运行它。它运行正常，没有任何错误。只有涉及火花库时才会出现问题。

我还尝试将应用程序打包为超级jar并运行它。仍然出现同样的错误。

导致此问题的原因是什么？ spark是否使用其他版本的httpcore和httpclient（除了我尝试过的那些版本）？什么是最好的解决方案？

现在，我只能想到将应用程序分成两部分，一部分处理json并将其保存为文本文件，另一部分是填充hive表的spark应用程序。我不认为打包带有所需版本的http组件的自定义火花罐对我有用，因为我将在群集上运行它并且无法更改默认库。

FIX：

我试图在他的评论中使用cricket_007指出的maven-shader-plugin。我在pom.xml中添加了以下行：

<plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.4.1</version>
        <executions>
            <execution>
                <phase>package</phase>
                <goals>
                    <goal>shade</goal>
                </goals>
                <configuration>
                    <relocations>
                        <relocation>
                            <pattern>org.apache.http</pattern>
                            <shadedPattern>org.shaded.apache.http</shadedPattern>
                        </relocation>
                    </relocations>
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                        </filter>
                    </filters>
                    <shadedArtifactAttached>true</shadedArtifactAttached>
                    <shadedClassifierName>shaded</shadedClassifierName>
                </configuration>

            </execution>
        </executions>
        </plugin>

程序现在运行没有任何错误！希望这有助于其他人。

java.lang.NoSuchFieldError：Spark应用程序的{INSTANCE错误

0 个答案: