我是新来的火花。我面临一个非常奇怪的问题。尝试开发一个从Web api获取JSON数据的spark应用程序,并将数据放入Hive表中。我将我的计划分为两部分:
1.) Access.java - Connects to the Web Api and gets the Json
2.) test.scala - parses the json and writes the json to a hive table(This is where spark comes into picture)
这是我的代码:
1)Access.java:
public class Access {
JSONArray getToes(){
CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials("xxxxxx","xxxxxxx"));
HttpClientContext localContext = HttpClientContext.create();
localContext.setCredentialsProvider(credentialsProvider);
HttpHost proxy = new HttpHost("xxxxxxxxxxxxxxxxxxx", 8080, "http");
RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
HttpClient httpClient = HttpClients.custom().setDefaultCredentialsProvider(credentialsProvider).build();
HttpGet toesGet = new HttpGet("https://api.riskrecon.com/v0.1/toes");
toesGet.setConfig(config);
toesGet.setHeader("Accept","Application/Json");
toesGet.addHeader("Authorization","Bearer xxxxxxxx");
try {
HttpResponse toes = httpClient.execute(toesGet);
System.out.println(toes.getStatusLine());
//System.out.println(toes.getAllHeaders().toString());
System.out.println(toes.getEntity().toString());
if(toes.getStatusLine().getStatusCode() == 200) {
JSONParser parser = new JSONParser();
JSONArray arr = (JSONArray) parser.parse(EntityUtils.toString(toes.getEntity()));
System.out.println(arr);
return arr;
}
} catch (Exception e){
e.printStackTrace();
}
return null;
}
}
2.)test.scala:
object test {
def main(args: Array[String]): Unit = {
println("Hello, world!")
val acc = new Access
val arr = acc.getToes()
print(arr)
//System.setProperty("hadoop.home.dir", "path") //For running on windows
val conf = new SparkConf().setAppName("RiskRecon")//.setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val hiveContext = new HiveContext(sc)
val obj = arr.get(0)
val rdd = sc.parallelize(Seq(arr.toString))
val dataframe = hiveContext.read.json(rdd)
println("Size of Json "+arr.size())
println("Size of dataframe "+dataframe.count())
println(dataframe.show())
print(dataframe.getClass)
}
}
我通过maven添加所有依赖项。这是我的POM.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>Test_Scala_project</groupId>
<artifactId>xxxxxxxx</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>TestJar</name>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.googlecode.json-simple/json-simple -->
<dependency>
<groupId>com.googlecode.json-simple</groupId>
<artifactId>json-simple</artifactId>
<version>1.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.6.0</version>
</dependency>
</dependencies>
</project>
现在,我可以打包jar并在Windows上进行spark-submit。但是当我在Linux上spark-submit
时,我会遇到错误。这是我的命令:
spark-submit --verbose --master yarn --class test app.jar
它给我一个错误说:Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/client/protocol/HttpClientContext
然后我添加所需的罐子并尝试再次运行:
spark-submit --verbose --master yarn --class test --jars httpclient-4.5.2.jar,httpcore-4.4.4.jar app.jar
现在,我得到了这个奇怪的错误:
Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.<clinit>(SSLConnectionSocketFactory.java:144)
at org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:966)
at Access.getToes(Access.java:29)
at test$.main(test.scala:9)
at test.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
我已经搜索过类似的错误,但没有任何内容存在火花。但我找到了几个类似的链接:
NoSuchMethodError while running AWS S3 client on Spark while javap shows otherwise
该错误看起来像是一个错误的依赖问题。在上面的第一个链接中,他提到我们内部使用httpclient 4.1.2。我有两件事情:
1。)如果在spark中有一个httpclient的默认库,为什么它会让我发现'classnotfound exception&#39;当我在没有添加http库的情况下运行应用程序时?
2.。)我试图在命令中包含httpclient和httpcore的4.1.2版本并再次运行。这是命令:
spark-submit --verbose --master yarn --class test --jars httpclient-4.1.2.jar,httpcore-4.1.2.jar app.jar
它再次给出了ClassNotFound错误:
java.lang.NoClassDefFoundError: org/apache/http/client/protocol/HttpClientContext
这太奇怪了。它给了我不同版本的库的不同错误。我也尝试在我的pom.xml中更改我的http版本。我也试图完全从pom.xml中删除http依赖项,因为火花库内部有它(从我目前为止所知),但它仍然给我同样的错误。
我还尝试将Access.java打包为一个单独的应用程序,并使用java -jar命令运行它。它运行正常,没有任何错误。只有涉及火花库时才会出现问题。
我还尝试将应用程序打包为超级jar并运行它。仍然出现同样的错误。
导致此问题的原因是什么? spark是否使用其他版本的httpcore和httpclient(除了我尝试过的那些版本)?什么是最好的解决方案?
现在,我只能想到将应用程序分成两部分,一部分处理json并将其保存为文本文件,另一部分是填充hive表的spark应用程序。我不认为打包带有所需版本的http组件的自定义火花罐对我有用,因为我将在群集上运行它并且无法更改默认库。
FIX:
我试图在他的评论中使用cricket_007指出的maven-shader-plugin。我在pom.xml中添加了以下行:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>org.apache.http</pattern>
<shadedPattern>org.shaded.apache.http</shadedPattern>
</relocation>
</relocations>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>shaded</shadedClassifierName>
</configuration>
</execution>
</executions>
</plugin>
程序现在运行没有任何错误!希望这有助于其他人。