在Amazon EMR 4.0.0上,设置/etc/spark/conf/spark-env.conf无效

时间:2015-09-29 22:20:42

标签: amazon-web-services apache-spark apache-spark-sql emr

我在Amazon EMR上启动基于spark的hiveserver2,它具有额外的类路径依赖性。由于Amazon EMR中的此错误:

https://petz2000.wordpress.com/2015/08/18/get-blas-working-with-spark-on-amazon-emr/

我的类路径无法通过“--driver-class-path”选项

提交

所以我有必要修改/etc/spark/conf/spark-env.conf以添加额外的类路径:

# Add Hadoop libraries to Spark classpath
SPARK_CLASSPATH="${SPARK_CLASSPATH}:${HADOOP_HOME}/*:${HADOOP_HOME}/../hadoop-hdfs/*:${HADOOP_HOME}/../hadoop-mapreduce/*:${HADOOP_HOME}/../hadoop-yarn/*:/home/hadoop/git/datapassport/*"

其中“/ home / hadoop / git / datapassport / *”是我的类路径。

但是,在成功启动服务器之后,Spark环境参数显示我的更改无效:

spark.driver.extraClassPath :/usr/lib/hadoop/*:/usr/lib/hadoop/../hadoop-hdfs/*:/usr/lib/hadoop/../hadoop-mapreduce/*:/usr/lib/hadoop/../hadoop-yarn/*:/etc/hive/conf:/usr/lib/hadoop/../hadoop-lzo/lib/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*

此配置文件是否已过时?新文件在哪里以及如何解决此问题?

2 个答案:

答案 0 :(得分:2)

您是否尝试在spark.driver.extraClassPath中设置spark-defaults?像this这样的东西:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "${SPARK_CLASSPATH}:${HADOOP_HOME}/*:${HADOOP_HOME}/../hadoop-hdfs/*:${HADOOP_HOME}/../hadoop-mapreduce/*:${HADOOP_HOME}/../hadoop-yarn/*:/home/hadoop/git/datapassport/*"
    }
  }
]

答案 1 :(得分:2)

您可以使用--driver-classpath。

从新的EMR集群在主节点上启动spark-shell。

spark-shell --master yarn-client
scala> sc.getConf.get("spark.driver.extraClassPath")
res0: String = /etc/hadoop/conf:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*

使用--bootstrap-action将您的JAR文件添加到EMR集群。

当你将spark-submit prepend(或追加)你的JAR文件调用你从spark-shell获得的extraClassPath的值时

spark-submit --master yarn-cluster --driver-classpath /home/hadoop/my-custom-jar.jar:/etc/hadoop/conf:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*

这对我来说使用EMR版本4.1和4.2。

构建spark.driver.extraClassPath的过程可能会在不同版本之间发生变化,这可能是SPARK_CLASSPATH不再起作用的原因。