Initialise non-serializable object that reads from HDFS on the workers

时间:2016-05-03 20:10:41

标签: apache-spark

I need to use a non-serializable Java object from inside a map operation. If I try to initialise it first on the Driver I get the "Task not serializable" error. The best option would be declare it for each parition, but the initialisation of the object reads some files from HDFS. To be able to do that, I would need the sc.hadoopConfiguration, variable that seems to be not available on the workers (NullPointerException).

Can I read from HDFS on the workers side? If so, I would be able to initialise the object for each partition using mapPartitions. If not, what is the best approach?

1 个答案:

答案 0 :(得分:1)

您可以使用sc.hadoopConfiguration打包SerializableWritable并在您的代码中使用它,如下所示:

  val hadoopConf = new SerializableWritable(sc.hadoopConfiguration)
  sc.parallelize(1 to 1000, 4).mapPartitions { iter =>
    val conf = hadoopConf.value
    ...
  }