AWS EMR 5.11.0 - Spark上的Apache Hive

时间:2018-01-28 22:23:35

标签: apache-spark hive amazon-emr

我正在尝试在AWS EMR 5.11.0上使用Spark设置Apache Hive。 Apache Spark版本 - 2.2.1 Apache Hive版本 - 2.3.2 纱线日志显示如下错误:

18/01/28 21:55:28错误ApplicationMaster:用户类抛出异常:java.lang.NoSuchFieldError:SPARK_RPC_SERVER_ADDRESS java.lang.NoSuchFieldError:SPARK_RPC_SERVER_ADDRESS     在org.apache.hive.spark.client.rpc.RpcConfiguration。(RpcConfiguration.java:47)     在org.apache.hive.spark.client.RemoteDriver。(RemoteDriver.java:134)     在org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     在org.apache.spark.deploy.yarn.ApplicationMaster $$ anon $ 2.run(ApplicationMaster.scala:635)

蜂房server2.log: 2018-01-28T21:56:50,109 ERROR [HiveServer2-Background-Pool:Thread-68([])]:client.SparkClientImpl(SparkClientImpl.java :(112)) - 等待客户端连接超时。 可能的原因包括网络问题,远程驱动程序中的错误或群集没有可用资源等。 请查看YARN或Spark驱动程序的日志以获取更多信息。 java.util.concurrent.ExecutionException:java.util.concurrent.TimeoutException:等待客户端连接超时。         at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41)〜[netty-all-4.0.52.Final.jar:4.0.52.Final]         在org.apache.hive.spark.client.SparkClientImpl。(SparkClientImpl.java:109)〜[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0]         在org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)〜[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0]         在org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:101)〜[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0]         在org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient。(RemoteHiveSparkClient.java:97)〜[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0]         在org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:73)〜[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0]         在org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62)〜[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0 ]         在org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115)~ [hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0 ]         在org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(CarkUtilities.java:126)~ [hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0] < / p>

另外, 2018-01-28T21:56:50,110 ERROR [HiveServer2-Background-Pool:Thread-68([])]:spark.SparkTask(SessionState.java:printError(1126)) - 执行spark任务失败,异常&# 39; org.apache.hadoop.hive.ql.metadata.HiveException(无法创建spark客户端。)&#39; org.apache.hadoop.hive.ql.metadata.HiveException:无法创建spark客户端。         在org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:64)         在org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(CarkSessionManagerImpl.java:115)         在org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(CarkUtilities.java:126)         在org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:103)         在org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)         at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)         在org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)         在org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)         在org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)         在org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)         在org.apache.hadoop.hive.ql.Driver.run(Driver.java:1232)         在org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:255)         在org.apache.hive.service.cli.operation.SQLOperation.access $ 800(SQLOperation.java:91)         在org.apache.hive.service.cli.operation.SQLOperation $ BackgroundWork $ 1.run(SQLOperation.java:348)         at java.security.AccessController.doPrivileged(Native Method)         在javax.security.auth.Subject.doAs(Subject.java:422)         在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)         在org.apache.hive.service.cli.operation.SQLOperation $ BackgroundWork.run(SQLOperation.java:362)         at java.util.concurrent.Executors $ RunnableAdapter.call(Executors.java:511)         at java.util.concurrent.FutureTask.run(FutureTask.java:266)         在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)         at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)         在java.lang.Thread.run(Thread.java:748) 引起:java.lang.RuntimeException:java.util.concurrent.ExecutionException:java.util.concurrent.TimeoutException:等待客户端连接超时。

有人能指出我在配置中可能缺少的内容吗?

4 个答案:

答案 0 :(得分:1)

抱歉,EMR尚不支持Hive on Spark。我自己还没有尝试过,但我认为你的错误的可能原因可能是EMR支持的Spark版本与Hive所依赖的Spark版本不匹配。我最后一次检查时,Hive在Spark上运行Hive时不支持Spark 2.x.鉴于您的第一个错误是NoSuchFieldError,看起来版本不匹配是最可能的原因。超时错误可能是红色鲱鱼。

答案 1 :(得分:0)

EMR Spark支持Hive 1.2.1版,而不支持hive 2.x版。您能否查看 / usr / lib / spark / jars / 目录中提供的hive jar版本。在hive版本2.x中添加了SPARK_RPC_SERVER_ADDRESS。

答案 2 :(得分:0)

sbt或pom.xml如下所示。

  

“org.apache.spark”%%“spark-streaming”%sparkVersion%“提供了”,

     

“org.apache.spark”%%“spark-sql”%sparkVersion%“提供”,

     

“org.apache.spark”%%“spark-hive”%sparkVersion%“提供了”,

我正在EMR上运行DataWarehouse(Hive),而spark应用程序将数据存储到DWH中。

答案 3 :(得分:0)

我可以像这样运行蜂巢

HIVE_AUX_JARS_PATH=$(find /usr/lib/spark/jars/ -name '*.jar' -and -not -name '*slf4j-log4j12*' -printf '%p:' | head -c-1) hive

然后,在其他SQL查询发出之前:

SET hive.execution.engine = spark;

使之持久化

添加行

export HIVE_AUX_JARS_PATH=$(find /usr/lib/spark/jars/ -name '*.jar' -and -not -name '*slf4j-log4j12*' -printf '%p:' | head -c-1)

进入/home/hadoop/.bashrc

在文件/etc/hive/conf/hive-site.xml中已设置:

<property>
  <name>hive.execution.engine</name>
  <value>spark</value>
</property>