我正在通过64G RAM 40核心计算机上的sparklyr
R包使用本地Spark实例。
我必须处理成千上万个文本文件并解析其中的电子邮件地址。目标是使数据框具有诸如用户名,顶级域,域,子域之类的列。然后将数据框保存为Parquet文件。我将文件分成2.5G的批处理,然后分别处理每个批处理。
大多数批次工作正常,但是,有时任务会失败,并且整个批次都“消失”。在这种情况下,这是日志的输出:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 47 in stage 2260.0 failed 1 times, most recent failure: Lost task 47.0 in stage 2260.0 (TID 112228, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$2: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.project_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:57)
at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:306)
at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:304)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1283)
at java.util.regex.Matcher.reset(Matcher.java:309)
at java.util.regex.Matcher.<init>(Matcher.java:229)
at java.util.regex.Pattern.matcher(Pattern.java:1093)
at java.util.regex.Pattern.split(Pattern.java:1206)
at java.util.regex.Pattern.split(Pattern.java:1273)
at scala.util.matching.Regex.split(Regex.scala:526)
at org.apache.spark.ml.feature.RegexTokenizer$$anonfun$createTransformFunc$2.apply(Tokenizer.scala:144)
at org.apache.spark.ml.feature.RegexTokenizer$$anonfun$createTransformFunc$2.apply(Tokenizer.scala:141)
... 22 more
我经常使用FTRegexTokenizer
,例如在此处将电子邮件地址分为用户名和域:
spark_tbls_separated %<>%
ft_regex_tokenizer(input_col = "email",
output_col = "email_split",
pattern = "@",
to_lower_case = FALSE) %>%
sdf_separate_column(column = "email_split",
into = c("email_user", "email_domain")) %>%
select(-email_split, -email)
所以现在我想知道到底是哪个Spark转换 导致了错误,以及错误针对哪种类型的输入数据,所以我实际上可以调试错误的原因。我猜想解决这个问题的唯一方法是查看任务日志(它们甚至存在吗?)。理想情况下,我可以查看任务47的日志并获取更详细的日志信息。如何在本地计算机上访问它们?
这是我的配置选项,例如历史记录服务器准备运行:
spark_config <- spark_config()
spark_config$`sparklyr.shell.driver-memory` <- "64G"
spark_config$spark.memory.fraction <- 0.75
spark_config$spark.speculation <- TRUE
spark_config$spark.speculation.multiplier <- 2
spark_config$spark.speculation.quantile <- 0.5
spark_config$sparklyr.backend.timeout <- 3600 * 2 # two-hour timeout
spark_config$spark.eventLog.enabled <- TRUE
spark_config$spark.eventLog.dir <- "file:///tmp/spark-events"
spark_config$spark.history.fs.logDirectory <- "file:///tmp/spark-events"
sc <- spark_connect(master = "local", config = spark_config)
请注意,此问题不是关于此处看到的实际错误,而是有关检查任务日志以找出我的sparklyr
脚本失败的那一行的可能性。 >