纱线模式中的Spark文件记录器

时间:2018-01-15 13:18:08

标签: apache-spark log4j yarn

我想创建一个自定义记录器,该记录器从集群节点中特定文件夹中的执行程序的消息中进行写入。我在SPARK_HOME / conf /中编辑了我的log4j.properties文件,如下所示:

log4j.rootLogger=${root.logger}
root.logger=WARN,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
shell.log.level=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.logger.org.apache.spark.repl.Main=${shell.log.level}
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=${shell.log.level}

#My logger to write usefull messages in a local file
log4j.logger.jobLogger=INFO, RollingAppenderU

log4j.appender.RollingAppenderU=org.apache.log4j.DailyRollingFileAppender
log4j.appender.RollingAppenderU.File=/var/log/sparkU.log
log4j.appender.RollingAppenderU.DatePattern='.'yyyy-MM-dd
log4j.appender.RollingAppenderU.layout=org.apache.log4j.PatternLayout
log4j.appender.RollingAppenderU.layout.ConversionPattern=[%p] %d %c %M - %m%n
log4j.appender.fileAppender.MaxFileSize=1MB
log4j.appender.fileAppender.MaxBackupIndex=1

我想使用jobLogger将文件保存在/var/log/sparkU.log中。 我在Python中创建了一个打印一些特定消息的小程序:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import *

spark = SparkSession \
        .builder \
    .master("yarn") \
        .appName("test custom logging") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()

log4jLogger = spark.sparkContext._jvm.org.apache.log4j 
log = log4jLogger.LogManager.getLogger("jobLogger") 

log.info("Info message")
log.warn("Warn message")
log.error("Error message")

我这样提交:

/usr/bin/spark-submit --master yarn --deploy-mode client /mypath/test_log.py

当我使用部署模式客户端时,文件将写入所需的位置。当我使用部署模式群集时,不会写入本地文件,但可以在YARN日志中找到消息。但是在两种模式的YARN日志中,我也采用了这个错误(从YARN日志输出火花簇模式):

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /var/log/sparkU.log (Permission denied)
    at java.io.FileOutputStream.open(Native Method)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:142)
    at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
    at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
    at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:223)
    at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
    at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
    at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
    at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
    at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
    at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672)
    at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516)
    at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
    at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
    at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
    at org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:117)
    at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:102)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.initializeLogIfNecessary(ApplicationMaster.scala:746)
    at org.apache.spark.internal.Logging$class.log(Logging.scala:46)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.log(ApplicationMaster.scala:746)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:761)
    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
log4j:ERROR Either File or DatePattern options are not set for appender [RollingAppenderU].
18/01/15 12:13:00 WARN spark.SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
18/01/15 12:13:02 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
18/01/15 12:13:04 INFO jobLogger: Info message
18/01/15 12:13:04 WARN jobLogger: Warn message
18/01/15 12:13:04 ERROR jobLogger: Error message

所以我有两个问题:

- 为什么打印第一条错误消息(java.io.FileNotFoundException)?我怀疑这是来自应用程序的主记录器但是如何阻止它打印此错误?我只想让执行者使用文件记录器。

- 是否可以使用群集模式,仍然能够在其中一台机器上的特定文件中写入?我想知道如果我能以某种方式进入像host这样的路径:port / myPath / spark.log并且所有执行器都会在其中一台机器中写入该文件。 提前感谢您的回复。

1 个答案:

答案 0 :(得分:1)

我能够使用自定义记录器以群集模式附加到Yarn中的本地文件中。

首先,在所有集群工作节点中,我将log4j文件提供在同一目录中(例如/home/myUser/log4j.custom.properties),并在同一节点中创建了一个文件夹,以便将日志保存在我的用户中路径(例如/ home / myUser / sparkLogs)。

之后,在提交中,我将该文件作为带有driver-java-options的驱动程序记录器传递,这就可以了。我使用这个提交(log4j文件和以前一样):

/usr/bin/spark2-submit 
--driver-java-options "-Dlog4j.configuration=file:///home/myUser/log4j.custom.properties"
--master yarn --deploy-mode client --driver-memory nG --executor-memory nG
--executor-cores n /home/myUser/sparkScripts/myCode.py