如何在Spark本地模式下检查任务日志

时间:2019-02-15 07:13:04

标签: r apache-spark logging sparklyr

我正在通过64G RAM 40核心计算机上的sparklyr R包使用本地Spark实例。

我必须处理成千上万个文本文件并解析其中的电子邮件地址。目标是使数据框具有诸如用户名,顶级域,域,子域之类的列。然后将数据框保存为Parquet文件。我将文件分成2.5G的批处理,然后分别处理每个批处理。

大多数批次工作正常,但是,有时任务会失败,并且整个批次都“消失”。在这种情况下,这是日志的输出:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 47 in stage 2260.0 failed 1 times, most recent failure: Lost task 47.0 in stage 2260.0 (TID 112228, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$2: (string) => array<string>)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.project_doConsume_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:57)
    at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:306)
    at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:304)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at java.util.regex.Matcher.getTextLength(Matcher.java:1283)
    at java.util.regex.Matcher.reset(Matcher.java:309)
    at java.util.regex.Matcher.<init>(Matcher.java:229)
    at java.util.regex.Pattern.matcher(Pattern.java:1093)
    at java.util.regex.Pattern.split(Pattern.java:1206)
    at java.util.regex.Pattern.split(Pattern.java:1273)
    at scala.util.matching.Regex.split(Regex.scala:526)
    at org.apache.spark.ml.feature.RegexTokenizer$$anonfun$createTransformFunc$2.apply(Tokenizer.scala:144)
    at org.apache.spark.ml.feature.RegexTokenizer$$anonfun$createTransformFunc$2.apply(Tokenizer.scala:141)
    ... 22 more

我经常使用FTRegexTokenizer,例如在此处将电子邮件地址分为用户名和域:

spark_tbls_separated %<>%
ft_regex_tokenizer(input_col = "email",
                         output_col = "email_split",
                         pattern = "@",
                         to_lower_case = FALSE) %>%
      sdf_separate_column(column = "email_split",
                          into = c("email_user", "email_domain")) %>%
      select(-email_split, -email)

所以现在我想知道到底是哪个Spark转换 导致了错误,以及错误针对哪种类型的输入数据,所以我实际上可以调试错误的原因。我猜想解决这个问题的唯一方法是查看任务日志(它们甚至存在吗?)。理想情况下,我可以查看任务47的日志并获取更详细的日志信息。如何在本地计算机上访问它们?

这是我的配置选项,例如历史记录服务器准备运行:

spark_config <- spark_config()
spark_config$`sparklyr.shell.driver-memory` <- "64G"
spark_config$spark.memory.fraction <- 0.75
spark_config$spark.speculation <- TRUE
spark_config$spark.speculation.multiplier <- 2
spark_config$spark.speculation.quantile <- 0.5
spark_config$sparklyr.backend.timeout <- 3600 * 2 # two-hour timeout
spark_config$spark.eventLog.enabled <- TRUE
spark_config$spark.eventLog.dir <- "file:///tmp/spark-events"
spark_config$spark.history.fs.logDirectory  <- "file:///tmp/spark-events"

sc <- spark_connect(master = "local", config = spark_config)

请注意,此问题不是关于此处看到的实际错误,而是有关检查任务日志以找出我的sparklyr脚本失败的那一行的可能性。 >

0 个答案:

没有答案