广播大型查找表会导致kryoserializer错误

时间:2017-02-27 17:54:19

标签: java scala apache-spark

我有一个包含大小约为10GB的对象的大型RDD。我想通过命令将其转换为要在spark中使用的查找表:

val lookupTable = sparkContext.broadcast(entitiesRDD.collect)但它失败了:

17/02/27 17:33:25 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, d1): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 2. To avoid this, increase spark.kryoserializer.buffer.max value. at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:299) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:240) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

我无法将spark.kryoserializer.buffer.max增加到2048mb,否则我会收到错误:

Caused by: java.lang.IllegalArgumentException: spark.kryoserializer.buffer.max must be less than 2048 mb, got: + 2048 mb. at org.apache.spark.serializer.KryoSerializer.<init>(KryoSerializer.scala:66)

其他人如何在spark中创建大型查找表?

0 个答案:

没有答案