我正在尝试添加JSONSerDe jar文件,以便访问json数据,将来自spark作业的JSON数据加载到hive表。我的代码如下所示:
SparkConf sparkConf = new SparkConf().setAppName("KafkaStreamToHbase");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(10));
final SQLContext sqlContext = new SQLContext(sc);
final HiveContext hiveContext = new HiveContext(sc);
hiveContext.sql("ADD JAR hdfs://localhost:8020/tmp/hive-serdes-1.0-SNAPSHOT.jar");
hiveContext.sql("LOAD DATA INPATH '/tmp/mar08/part-00000' OVERWRITE INTO TABLE testjson");
但我最终得到以下错误:
java.net.MalformedURLException: unknown protocol: hdfs
at java.net.URL.<init>(URL.java:592)
at java.net.URL.<init>(URL.java:482)
at java.net.URL.<init>(URL.java:431)
at java.net.URI.toURL(URI.java:1096)
at org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at com.macys.apm.kafka.spark.parquet.KafkaStreamToHbase$2.call(KafkaStreamToHbase.java:148)
at com.macys.apm.kafka.spark.parquet.KafkaStreamToHbase$2.call(KafkaStreamToHbase.java:141)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$2.apply(JavaDStreamLike.scala:327)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$2.apply(JavaDStreamLike.scala:327)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
我能够通过hive shell添加jar。但是当我尝试在spark job(javacode)中使用hiveContext.sql()添加时,它会抛出一个错误。快速帮助将是非常有用的。
感谢。
答案 0 :(得分:3)
您可以通过将 - jars 传递给spark-submit命令在运行时传递udf jar,或者您可以将这些必需的jar复制到spark libs。
基本上它支持文件,hdfs和常春藤方案。
您正在使用哪种版本的火花。我在最新版本的ClientWrapper.scala中看不到addJar方法。
答案 1 :(得分:1)
我只是看了火花码。似乎这是火花方面的问题。他们使用简单的java.net.Uri来获取方案。 Java URI类不了解hdfs方案。理想情况下,他们必须将 FsUrlStreamHandlerFactory.java (即[hdfs-link https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsUrlStreamHandlerFactory.java])注册到URI。
您可以从本地文件系统添加jar,也可以在作业提交时传递jars,也可以将jar复制到spark lib文件夹。