批量插入Hbase:ConsumerRecord不可序列化

时间:2017-08-30 16:54:22

标签: scala apache-spark apache-kafka hbase rdd

我有一个Kafka客户端,用于轮询主题以获取记录并将其存储为consumerRecords: ConsumerRecords[String, String]。我想迭代每条记录,并将(offset, value)作为(k, v)写入Hbase表。我试图通过Spark并行化这些记录,以便我可以将其映射到RDD,以便批量插入Hbase。

val hbaseTable: String = "/app/raphattack/TEST"
val conf: Configuration = HBaseConfiguration.create()
val admin: Admin = ConnectionFactory.createConnection(conf).getAdmin
val connection: Connection = ConnectionFactory.createConnection(admin.getConfiguration)
val table: Table = connection.getTable(TableName.valueOf(hbaseTable))

val job = Job.getInstance(conf)
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
HFileOutputFormat2.configureIncrementalLoadMap(job, table)

val spark: SparkSession = SparkSession.builder.enableHiveSupport.getOrCreate
val records: RDD[ConsumerRecord[String, String]] = spark.sparkContext.parallelize(consumerRecords.toSeq)

val rdd: RDD[(ImmutableBytesWritable, KeyValue)] = records.map(record => {
  val kv: KeyValue = new KeyValue(Bytes.toBytes(record.offset()), "cf".getBytes(), "c1".getBytes(), s"${record.value}".getBytes())
  (new ImmutableBytesWritable(Bytes.toBytes(record.offset())), kv)
})

rdd.saveAsNewAPIHadoopFile("/tmp/test", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration)

我遇到了这个例外:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Failed to serialize task 0, not attempting to retry it. 
Exception during serialization: java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord
Serialization stack:
        - object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic = test, partition = 0, offset = 14691347, timestamp = 0, producer = null, key = 1, value = {"id":1.0,"name":"test"}))

是否可以使ConsumerRecord对象可序列化?如果没有,我怎样才能在不牺牲Hbase写入速度的情况下迭代记录?

1 个答案:

答案 0 :(得分:0)

我正在尝试在UnitTest中做同样的事情。

基本上你需要在SparkConf上设置一个序列化器

.set("spark.serializer", classOf[KryoSerializer].getName)