Question

我试图使用Spark在HBase中写入数据，但获得异常Exception in thread "main" org.apache.spark.SparkException: Task not serializable。我试图使用以下代码片段在每个工作节点上打开连接：

 val conf = HBaseConfiguration.create()
 val tableName = args(1)
 conf.set(TableInputFormat.INPUT_TABLE, tableName)
 val admin = new HBaseAdmin(conf)
 val tableDesc = new HTableDescriptor(tableName)
 val columnDesc = new HColumnDescriptor("cf".getBytes()).setBloomFilterType(BloomType.ROWCOL).setMaxVersions(5)
 tableDesc.addFamily(columnDesc)
 admin.createTable(tableDesc)

 rddData.foreachPartition( part => {
    val table = new HTable(conf, tableName)
    part.foreach( elem => {
      var put = new Put(Bytes.toBytes(elem._1))
      put.add(Bytes.toBytes("cf"), Bytes.toBytes("col"), Bytes.toBytes(elem._2))
      table.put(put)
   })
   table.flushCommits()
 })

如何在使用spark写入HBase时使任务可序列化？

Answer 1

如果我没有弄错conf（hadoop配置的实例）不可序列化。

编写代码，使所有不可序列化的部分都在foreachPartition块中（以便在节点上执行）。这是一个我创建第二个conf等的例子。：

`

rddData.foreachPartition( part => {
     val conf2 = HBaseConfiguration.create()
     val tableName2 = args(1)
     conf2.set(TableInputFormat.INPUT_TABLE, tableName2)
     val table2 = new HTable(conf2, tableName2)
     part.foreach( elem => {
      var put = new Put(Bytes.toBytes(elem._1))
      put.add(Bytes.toBytes("cf"), Bytes.toBytes("col"), Bytes.toBytes(elem._2))
      table2.put(put)
   })
   table2.flushCommits()
 })

`

如何使用Spark在HBase中进行任务序列化

1 个答案: