我在Azure VM(本地模式)上运行Spark 2.1.1作业,16核,55 GB RAM。
我使用此命令初始化:
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
在数据上运行以下脚本:
import io.archivesunleashed.spark.matchbox.{ExtractDomain, ExtractLinks, RemoveHTML, RecordLoader, WriteGEXF}
import io.archivesunleashed.spark.rdd.RecordRDD._
val r = RecordLoader.loadArchives("/data2/toronto-mayor/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/data2/toronto-mayor-data/all-domains")
数据相对较小(290GB),但由292个文件组成,大小从7GB到38KB不等。平均大小约1GB。这台机器可以交换100GB,我在执行时监视htop
,没有超过45GB的内存峰值,也没有交换使用。这一切似乎运作良好,然后崩溃......
崩溃时出现以下错误:
ERROR Executor - Exception in task 13.0 in stage 0.0 (TID 13)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.StringCoding.safeTrim(StringCoding.java:89)
at java.lang.StringCoding.access$100(StringCoding.java:50)
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:154)
at java.lang.StringCoding.decode(StringCoding.java:193)
at java.lang.StringCoding.decode(StringCoding.java:254)
at java.lang.String.<init>(String.java:546)
at java.lang.String.<init>(String.java:566)
at io.archivesunleashed.data.WarcRecordUtils.getWarcResponseMimeType(WarcRecordUtils.java:102)
at io.archivesunleashed.spark.archive.io.ArchiveRecord.<init>(ArchiveRecord.scala:74)
at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
at io.archivesunleashed.spark.matchbox.RecordLoader$$anonfun$2.apply(RecordLoader.scala:37)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
此网站上的许多其他讨论涉及群集模式或设置--driver-memory
。任何帮助表示赞赏。
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.memory.fraction=0.4 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.memory.fraction=0.8 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.default.parallelism=64 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.default.parallelism=500 --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=100G --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --executor-memory 10G --packages "io.archivesunleashed:aut:0.12.1"
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --driver-memory 45G --executor-memory 45G --packages "io.archivesunleashed:aut:0.12.1"
答案 0 :(得分:0)
解决方案最终是减少工作线程的数量。
默认情况下,Spark运行本地[*],它运行机器上的线程数=核心数,在本例中为16。
通过减少到当地[14]完成的工作。
要运行的语法:
./spark-2.1.1-bin-hadoop2.6/bin/spark-shell --master local[12] --driver-memory 45G --packages "io.archivesunleashed:aut:0.12.1"