无法创建客户端-Spark作为带有配置单元的执行引擎

时间:2019-07-02 13:44:51

标签: amazon-web-services apache-spark hadoop hive amazon-emr

我有一个32GB的单节点Amazon EMR集群,该集群具有配置单元2.3.4,已安装spark 2.4.2和Hadoop 2.8.5。

我正在尝试将spark配置为蜂巢的执行引擎。

我已通过以下命令在蜂巢中链接了Spark jar文件:

sudo ln -s /usr/lib/spark/jars/spark-core_2.11-2.4.2.jar
sudo ln -s /usr/lib/spark/jars/spark-network-common_2.11-2.4.2.jar
sudo ln -s /usr/lib/spark/jars/scala-library-2.11.12.jar

我也在hive-site.xml文件中设置了执行引擎。我已将以下内容添加到hive-site.xml文件夹中的/etc/hive/conf/中:

<property>
  <name>hive.execution.engine</name>
  <value>spark</value>
</property>

<property>
   <name>spark.master</name>
   <value>spark://<EMR hostname>:7077</value>
 </property>
<property>
   <name>spark.eventLog.enabled</name>
   <value>true</value>
 </property>
<property>
   <name>spark.eventLog.dir</name>
   <value>/tmp</value>
 </property>
<property>
   <name>spark.serializer</name>
   <value>org.apache.spark.serializer.KryoSerializer</value>
 </property>
<property>
<property>
  <name>spark.yarn.jars</name>
  <value>hdfs://<EMR hostname>:8020/spark-jars/*</value>
</property>

此外,我已将spark中的所有jar复制到名为spark-jars的hdfs文件夹中

运行配置单元查询时,出现以下错误:

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: false
FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.

我还检查了配置单元hadoop日志,它只给我以下内容:

2019-07-02T13:33:23,831 ERROR [f7d8916c-25f1-4d90-8919-07c4b3422b35 main([])]: ql.Driver (SessionState.java:printError(1126)) - FAILED: Semanti$
org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to c$
        at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:240)
        at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:173)
        at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
        at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
        at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
        at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:56)
        at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
        at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
        at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
        at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
        at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runSetReducerParallelism(SparkCompiler.java:288)
        at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:122)
        at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:140)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11293)
        at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:286)
        at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:512)
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
        at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
        at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)

我正在运行以下hql文件:

set hive.spark.client.server.connect.timeout=300000ms;
set spark.executor.memory=4915m;
set spark.executor.cores=2;
set spark.yarn.executor.memoryOverhead=1229m;
set spark.executor.instances=2;
set spark.driver.memory=4096m;
set spark.yarn.driver.memoryOverhead=400m;

select column_name from table_name group by column_name;

如果您需要查看其他配置文件,请告诉我...

是否由于版本不兼容而导致此错误? 还是无法将Spark用作Amazon EMR上的蜂巢的执行引擎?

0 个答案:

没有答案