我有一个包含大小约为10GB的对象的大型RDD。我想通过命令将其转换为要在spark中使用的查找表:
val lookupTable = sparkContext.broadcast(entitiesRDD.collect)
但它失败了:
17/02/27 17:33:25 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, d1): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 2. To avoid this, increase spark.kryoserializer.buffer.max value.
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:299)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:240)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
我无法将spark.kryoserializer.buffer.max增加到2048mb,否则我会收到错误:
Caused by: java.lang.IllegalArgumentException: spark.kryoserializer.buffer.max must be less than 2048 mb, got: + 2048 mb.
at org.apache.spark.serializer.KryoSerializer.<init>(KryoSerializer.scala:66)
其他人如何在spark中创建大型查找表?