如何让自定义 log4j.properties 对 AWS EMR 集群上的 Spark 驱动程序和执行程序生效?

时间:2021-04-12 05:38:41

标签: amazon-web-services apache-spark log4j amazon-emr

我有一个 AWS CLI 集群创建命令,我正在尝试修改它以便它 使我的驱动程序和执行程序能够使用自定义的 log4j.properties 文件。和 Spark独立集群我已经成功使用了使用的方法 --files 与通过设置指定的 -Dlog4j.configuration= 一起切换 spark.driver.extraJavaOptions 和 spark.executor.extraJavaOptions。

我尝试了许多不同的排列和变化,但还没有让它与 我在 AWS EMR 集群上运行的 Spark 作业。

我使用 AWS CLI 的“创建集群”命令和一个中间步骤来下载我的 spark jar,解压缩 它获取与该 .jar 一起打包的 log4j.properties。然后我复制 log4j.properties 到我的 hdfs /tmp 文件夹并尝试通过“--files”分发该 log4j.properties 文件。

注意,我也试过这个没有 hdfs(指定 --files log4j.properties 而不是 --files hdfs:///tmp/log4j.properties),这也不起作用。

我最新的这个命令的非工作版本(使用 hdfs)如下。我想知道是否有人可以分享 一个真正有效的食谱。当我运行这个版本时,驱动程序的命令输出是:

log4j: Trying to find [log4j.properties] using context classloader sun.misc.Launcher$AppClassLoader@1e67b872.
log4j: Using URL [file:/etc/spark/conf.dist/log4j.properties] for automatic log4j configuration.
log4j: Reading configuration from URL file:/etc/spark/conf.dist/log4j.properties
log4j: Parsing for [root] with value=[WARN,stdout].

从上面我可以看到我的 log4j.properties 文件没有被拾取(默认是)。 除了-Dlog4j.configuration=log4j.properties,我还尝试通过配置 -Dlog4j.configuration=classpath:log4j.properties(再次失败)。

非常感谢任何指导!

AWS 命令​​

jarPath=s3://com-acme/deployments/spark.jar
class=com.acme.SparkFoo


log4jConfigExtractCmd="aws s3 cp $jarPath /tmp/spark.jar ; cd /home/hadoop ; unzip /tmp/spark.jar log4j.properties ;  hdfs dfs -put log4j.properties /tmp/log4j.properties  "


aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark \
--tags 'Project=mouse' \
      'Owner=SwarmAnalytics'\
       'DatadogMonitoring=True'\
       'StreamMonitorRedshift=False'\
       'DeployRedshiftLoader=False'\
       'Environment=dev'\
       'DeploySpark=False'\
       'StreamMonitorS3=False'\
       'Name=CCPASixCore' \
--ec2-attributes '{"KeyName":"mouse-spark-2021","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-07039960","EmrManagedSlaveSecurityGroup":"sg-09c806ca38fd32353","EmrManagedMasterSecurityGroup":"sg-092288bbc8812371a"}' \
--release-label emr-5.27.0 \
--log-uri 's3n://log-foo' \
--steps '[{"Args":["bash","-c", "$log4jConfigExtractCmd"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"downloadSparkJar"},{"Args":["spark-submit","--files", "hdfs:///tmp/log4j.properties","--deploy-mode","client","--class","$class","--driver-memory","24G","--conf","spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:G1HeapRegionSize=256    -Dlog4j.debug -Dlog4j.configuration=log4j.properties","--conf","spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:G1HeapRegionSize=256    -Dlog4j.debug -Dlog4j.configuration=log4j.properties","--conf","spark.yarn.executor.memoryOverhead=10g","--conf","spark.yarn.driver.memoryOverhead=10g","$jarPath"],"Type":"CUSTOM_JAR","ActionOnFailure":"CANCEL_AND_WAIT","Jar":"command-runner.jar","Properties":"","Name":"SparkFoo"}]'\
 --instance-groups '[{"InstanceCount":6,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"CORE","InstanceType":"r5d.4xlarge","Name":"Core - 6"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":4}]},"InstanceGroupType":"MASTER","InstanceType":"m5.2xlarge","Name":"Master - 1"}]' \
--configurations '[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"INFO","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR","log4j.logger.org.apache.hadoop":"ERROR","log4j.appender.stdout":"org.apache.log4j.ConsoleAppender","log4j.logger.io.netty":"ERROR","log4j.logger.org.apache.spark.scheduler.cluster":"ERROR","log4j.rootLogger":"WARN,stdout","log4j.appender.stdout.layout.ConversionPattern":"%d{yyyy-MM-dd HH:mm:ss,SSS} %p/%c{1}:%L - %m%n","log4j.logger.org.apache.spark.streaming.scheduler.JobScheduler":"INFO"}},{"Classification":"hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}},{"Classification":"spark-hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]'\
 --auto-terminate --ebs-root-volume-size 10 --service-role EMR_DefaultRole \
--security-configuration 'CCPA_dev_security_configuration_2' --enable-debugging --name 'SparkFoo' \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1 --profile sandbox

1 个答案:

答案 0 :(得分:2)

这是更改日志记录的方法。 AWS/EMR 上的最佳方法(我发现)是不要摆弄

spark.driver.extraJavaOptions  or 
spark.executor.extraJavaOptions

相反,利用如下所示的配置块>

[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"INFO","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR",

然后,假设您要将 com.foo 及其后代下的类完成的所有日志记录更改为 TRACE。然后您将上面的块更改为如下所示 ->

[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"TRACE","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR",