我正在处理一个管道,该管道读取多个hive表并将它们解析为一些DenseVectors,以便最终在SparkML中使用。我想进行大量迭代以找到最佳训练参数,包括模型输入和计算资源。我所使用的数据帧大约介于50-100gb之间,分布在YARN集群上的动态数量的执行程序中。
每当我尝试保存,无论是镶木地板还是saveAsTable,我都会得到一系列失败的任务,最后它们完全失败并建议提高spark.yarn.executor.memoryOverhead。每个id
都是一行,不超过几kb。
feature_df.write.parquet('hdfs:///user/myuser/featuredf.parquet',mode='overwrite',partitionBy='id')
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 98 in stage 33.0 failed 4 times, most recent failure: Lost task 98.3 in
stage 33.0 (TID 2141, rs172.hadoop.pvt, executor 441): ExecutorLostFailure
(executor 441 exited caused by one of the running tasks) Reason: Container
killed by YARN for exceeding memory limits. 12.0 GB of 12 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
我现在有2g。
Spark工作者目前正在获得10gb,并且驱动程序(不在集群中)获得16gb,maxResultSize为5gb。
我在写作之前缓存了数据帧,我还能做些什么来排除故障?
编辑:似乎它试图立即完成我的所有转换。当我查看saveAsTable()方法的详细信息时:
== Physical Plan ==
InMemoryTableScan [id#0L, label#90, features#119]
+- InMemoryRelation [id#0L, label#90, features#119], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Filter (isnotnull(id#0L) && (id#0L < 21326835))
+- InMemoryTableScan [id#0L, label#90, features#119], [isnotnull(id#0L), (id#0L < 21326835)]
+- InMemoryRelation [id#0L, label#90, features#119], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Project [id#0L, label#90, pythonUDF0#135 AS features#119]
+- BatchEvalPython [<lambda>(collect_list_is#108, 56845.0)], [id#0L, label#90, collect_list_is#108, pythonUDF0#135]
+- SortAggregate(key=[id#0L, label#90], functions=[collect_list(indexedSegs#39, 0, 0)], output=[id#0L, label#90, collect_list_is#108])
+- *Sort [id#0L ASC NULLS FIRST, label#90 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#0L, label#90, 200)
+- *Project [id#0L, UDF(segment#2) AS indexedSegs#39, cast(label#1 as double) AS label#90]
+- *BroadcastHashJoin [segment#2], [entry#12], LeftOuter, BuildRight
:- HiveTableScan [id#0L, label#1, segment#2], MetastoreRelation pmccarthy, reka_data_long_all_files
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- *Project [cast(entry#7 as string) AS entry#12]
+- HiveTableScan [entry#7], MetastoreRelation reka_trop50, public_crafted_audiences_sized
答案 0 :(得分:0)
我的建议是禁用动态分配。尝试使用以下配置运行它:
--master yarn-client --driver-memory 15g --executor-memory 15g --executor-cores 10 --num-executors 15 -Dspark.yarn.executor.memoryOverhead=20000 -Dspark.yarn.driver.memoryOverhead=20000 -Dspark.default.parallelism=500
答案 1 :(得分:0)
最终,我从Spark用户邮件列表中得到的线索是查看分区,包括余额和大小。正如计划者所拥有的那样,给一个执行器实例提供了太多的东西。将.repartition(1000)
添加到创建要写入的数据帧的表达式中会产生重大影响,并且可以通过在聪明的键列上创建和分区来实现更多的收益。