通过简单的select * from table
查询在Spark上运行Hive可以平稳运行,但是在联接和求和时,ApplicationMaster会为关联的spark容器返回以下堆栈跟踪:
2019-03-29 17:23:43 ERROR ApplicationMaster:91 - User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
2019-03-29 17:23:43 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
)
2019-03-29 17:23:43 ERROR ApplicationMaster:91 - Uncaught exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:486)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:800)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:799)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:824)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: java.util.concurrent.ExecutionException: Boxed Error
at scala.concurrent.impl.Promise$.resolver(Promise.scala:55)
at scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:47)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:244)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:724)
Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
2019-03-29 17:23:43 INFO ApplicationMaster:54 - Deleting staging directory hdfs://LOSLDAP01:9000/user/hdfs/.sparkStaging/application_1553880018684_0001
2019-03-29 17:23:43 INFO ShutdownHookManager:54 - Shutdown hook called
我已经尝试增加纱线容器的内存分配(并减少火花内存),但没有成功。
使用: Hadoop 2.9.2 火花2.3.0 蜂巢2.3.4
谢谢您的帮助。
答案 0 :(得分:1)
这是6个月前问的。希望这对其他人有帮助。 发生此错误的原因是在Hive版本2.x中添加了SPARK_RPC_SERVER_ADDRESS,默认情况下,Spark支持Hive 1.2.1。
我能够在EMR 5.25群集(Hadoop 2.8.5,hive 2.3.5,Spark 2.4.3)上使用this manual启用hive-on-spark,以便在YARN上运行。但是,手册需要更新,缺少一些关键项。
ln -s /usr/lib/spark/jars/scala-library-2.11.12.jar /usr/lib/hive/lib/scala-library.jar
ln -s /usr/lib/spark/jars/spark-core_2.11-2.4.3.jar /usr/lib/hive/lib/spark-core.jar
ln -s /usr/lib/spark/jars/spark-network-common_2.11-2.4.3.jar /usr/lib/hive/lib/spark-network-common.jar
ln -s /usr/lib/spark/jars/spark-unsafe_2.11-2.4.3.jar /usr/lib/hive/lib/spark-unsafe.jar
<property>
<name>spark.yarn.jars</name>
<value>hdfs://xxxx:8020/spark-jars/*</value>
</property>
手册缺少重要信息,您需要排除默认配置单元1.2.1 jar。这就是我所做的:
hadoop fs -mkdir /spark-jars
hadoop fs -put /usr/lib/spark/jars/*.jar /spark-jars/
hadoop fs -rm /spark-jars/*hive*1.2.1*
此外,您需要在spark-defaults.conf文件中添加以下内容:
spark.sql.hive.metastore.version 2.3.0;
spark.sql.hive.metastore.jars /usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
有关与不同版本的Hive Metastore交互的更多信息,请选中此link。
答案 1 :(得分:0)
事实证明,除非编写自己的自定义Hive连接器,否则Hive-on-Spark存在许多实现问题,并且根本无法使用。简而言之,Spark开发人员正在努力跟上Hive的发布,他们还没有决定如何处理向后兼容性,如何关注最新版本的Hive版本〜<2。
1)返回到Hive 1.x
不理想。尤其是如果您想与ORC等文件格式进行更现代的集成。
2)使用Hive-on-Tez
这是我们决定采用的方法。 *This solution does not break the open source stack*
,并且可以与纱线上的火花完美配合。像Azure,AWS和Hortonworks一样,第三方Hadoop生态系统都添加了专有代码,仅用于运行Hive-On-Spark,因为它变得一团糟。
通过安装Tez,您的Hadoop查询将按以下方式工作:
SparkSession.builder.enableHiveSupport().getOrCreate()
(这是pyspark代码)时在群集上使用Spark容器注意:由于我对这些板没有太大兴趣,因此我将其简短。询问详细信息,我很乐意为您提供帮助和扩展。
版本矩阵
Hadoop 2.9.2
Tez 0.9.2
Hive 2.3.4
Spark 2.4.2
Hadoop以集群模式安装。
这对我们有用。我不希望它在切换到Hadoop 3.x时能够无缝运行,我们将在将来的某个时候进行此操作,但是如果您不更改每个组件的主要发行版,它应该可以正常工作。
基本指南
select count(*) from myDb.myTable
。您应该从配置单元控制台中看到Tez栏。./sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
ConfVars.HIVE_STATS_JDBC_TIMEOUT -> TimeUnit.SECONDS,
目录中共享$HIVE_HOME/conf/hive-site.xml
。您必须制作此配置文件的纸本,而不是符号链接。原因是,如上所述,您必须从中删除所有与Tez相关的Hive配置值,以确保Spark与Tez独立地共存。这确实包括$SPARK_HOME/conf/
属性,必须将其保留为空。只需将其从Spark的hive-site.xml中完全删除,然后将其保留在Hive的hive-site.xml中即可。hive.execution.engine=tez
中设置属性$HADOOP_HOME/etc/hadoop/mapred-site.xml
。即使未将其设置为mapreduce.framework.name=yarn
,这两种环境也可以正确拾取。这仅意味着原始mapreduce作业将不会在Tez上运行,而Hive作业确实会使用它。这仅对于旧版作业来说是个问题,因为原始mapred已过时。祝你好运!