spark 2.4.2可以与Amazon EMR上的蜂巢2.3.4一起用作执行引擎吗?
我通过以下命令将jar文件与hive(scala库,spark-core,spark-common-network)链接在一起:
cd $HIVE_HOME/lib
ln -s $SPARK_HOME/jars/spark-network-common_2.11-2.4.2.jar
ln -s $SPARK_HOME/jars/spark-core_2.11-2.4.2.jar
ln -s $SPARK_HOME/jars/scala-library-2.11.12.jar
在hive-site.xml中添加了以下设置:
<property>
<name>hive.execution.engine</name>
<value>spark</value>
<description>Use Map Reduce as default execution engine</description>
</property>
<property>
<name>spark.master</name>
<value>spark://<EMR hostname>:7077</value>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>/tmp</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
<name>spark.yarn.jars</name>
<value>hdfs://<EMR hostname>:54310/spark-jars/*</value>
</property>
Spark已启动并正在运行,我还可以在pyspark中使用配置单元查询。 但是,当我尝试将Spark用作具有上述配置的蜂巢的执行引擎时,会引发以下错误:
Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Iterable
at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:173)
at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:56)
at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runSetReducerParallelism(SparkCompiler.java:288)
at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:122)
at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:140)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11293)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:286)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:512)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: scala.collection.Iterable
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 33 more
这是配置错误还是某些版本不兼容错误?
蜂巢还可以与tez完美配合...
答案 0 :(得分:1)
这清楚地表明配置单元使用的Scala jar库不匹配,因为您正在为带有Spark选项的配置单元使用不兼容的Scala更改。
Tez不使用spark和scala,这就是其正常工作的原因。 spark正在使用scala作为lang,但无法找到正确的版本。多数民众赞成在这个原因
java.lang.NoClassDefFoundError: scala/collection/Iterable
当您将带有hive的配置单元用作执行引擎时,这是一个非常常见的问题...
步骤: 。
1)转到$ HIVE_HOME / bin / hive
2)在编辑$ HIVE_HOME / bin / hive之前先备份文件
3)取得classpath变量,然后首先添加所有配置单元
CLASSPATH=${CLASSPATH}:${HIVE_LIB}/*.jar
for f in ${HIVE_LIB}/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
将火花库添加到hive类路径中,就像下面的classpath变量一样,该变量具有所有hive库。.
for f in ${SPARK_HOME}/jars/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
现在,在同一个classpath变量中有蜂巢jar和spark jar。 Spark jars具有Scala库,可以正确地与spark一起使用,并且没有版本兼容性问题。
3)现在,更改hive执行引擎,使其指向您已经知道的hive-site.xml中的spark ... /做
<property>
<name>hive.execution.engine</name>
<value>spark</value>
<description>Use Map Reduce as default execution engine</description>
</property>
Link Jar Files 现在我们建立到某些Spark jar文件的软链接,以便Hive可以找到它们:
ln -s /usr/share/spark/spark-2.2.0/dist/jars/spark-network-common_2.11-2.2.0.jar /usr/local/hive/apache-hive-2.3.0-bin/lib/spark-network-common_2.11-2.2.0.jar
ln -s /usr/share/spark/spark-2.2.0/dist/jars/spark-core_2.11-2.2.0.jar /usr/local/hive/apache-hive-2.3.0-bin/lib/spark-core_2.11-2.2.0.jar
ln -s /usr/share/spark/spark-2.2.0/dist/jars/scala-library-2.11.8.jar /usr/local/hive/apache-hive-2.3.0-bin/lib/scala-library-2.11.8.jar
- 结论:无论如何,您都需要确保正确的scala jar指向由spark用作执行引擎的配置单元...