我的spark作业在collect()语句中崩溃,并出现以下错误。 强调textBelow是我收到的错误。
Java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:83)
at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrapNoCoerce.callConstructor(ConstructorSite.java:105)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:239)
at org.oclc.wcsync.hadoop.serverdsl.record.InputRecord.<init>(InputRecord.groovy:50)
at org.oclc.wcsync.hadoop.serverdsl.record.InputRecordConstructorAccess.newInstance(Unknown Source)
at com.twitter.chill.Instantiators$$anonfun$reflectAsm$1.apply(KryoBase.scala:141)
at com.twitter.chill.Instantiators$$anon$1.newInstance(KryoBase.scala:125)
at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1090)
at com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:570)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:546)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at com.twitter.chill.Tuple1Serializer.read(TupleSerializers.scala:30)
at com.twitter.chill.Tuple1Serializer.read(TupleSerializers.scala:23)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:246)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
at org.apache.spark.util.collection.ExternalSorter$SpillReader.org$apache$spark$util$collection$ExternalSorter$SpillReader$$readNextItem(ExternalSorter.scala:558)
18/08/29 15:20:06 INFO storage.DiskBlockManager: Shutdown hook called
18/08/29 15:20:06 INFO util.ShutdownHookManager: Shutdown hook called
这是代码:
JavaRDD<File> myRecords = sc.parallelize(mapper.myFunction(records.collect())).cache()
myFunction使用“记录”列表并遍历它们。所以我正在使用records.collect()并将其传递到“ myFunction”中。但是collect()语句将所有数据带到驱动程序并导致此错误。我正在寻找可以用来避免此错误的任何替代方法。我知道可以使用count来代替collect,但是我在这里需要一个列表。
List myFunction(List<scala.Tuple2<String, Tuple1<List<Record>>>> data) {
List<File> list = []
// Iterate through the List of Tuple2 instance
list
}
非常感谢您的帮助。