如何修复PySpark 2.2.1版本中GC overhead limit exceeded
的问题。安装在Ubuntu 16.04.4上。
在Python 3.5.2脚本中,我将spark设置为:
spark = SparkSession.builder.appName('achats_fusion_files').getOrCreate()
spark.conf.set("spark.sql.pivotMaxValues", "1000000")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
spark.conf.set("spark.executor.memory", "1g")
spark.conf.set("spark.driver.memory", "1g")
如何通过使用Python脚本中的良好设置来解决问题?
发送错误信息:
18/03/14 09:57:25 ERROR Executor: Exception in task 34.0 in stage 36.0 (TID 2076)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.compile(Pattern.java:1667)
at java.util.regex.Pattern.<init>(Pattern.java:1351)
at java.util.regex.Pattern.compile(Pattern.java:1028)
at org.apache.spark.network.util.JavaUtils.byteStringAs(JavaUtils.java:266)
at org.apache.spark.network.util.JavaUtils.byteStringAsBytes(JavaUtils.java:302)
at org.apache.spark.util.Utils$.byteStringAsBytes(Utils.scala:1087)
at org.apache.spark.SparkConf.getSizeAsBytes(SparkConf.scala:310)
at org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:114)
at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:156)
at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:131)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:120)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
答案 0 :(得分:2)
直接从文档中获取,
帮助我的更多参数是,
可以通过在作业配置中设置spark.executor.extraJavaOptions来指定执行程序的所有GC调整标志。
请查看this以获取更多详细信息。
编辑:
在你的spark-defaults.conf中写,
spark.executor.JavaOptions -XX:+ UseG1GC
spark.executor.extraJavaOptions -XX:ConcGCThreads = 20 -XX:InitiatingHeapOcuupancyPercent = 35