Question

我试图广播连接，但我不知道如何解决序列化问题。

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

messages.map(_._2).filter(_.length > 0).foreachRDD(rdd => {
  val hbaseConf = HBaseConfiguration.create
  hbaseConf.set("hbase.rootdir", "hdfs://xxx.xxx.xxx.xxx:9000/hbase")
  hbaseConf.set("hbase.zookeeper.quorum", "Master,slave1,slave2")
  val connection = ConnectionFactory.createConnection(hbaseConf)

  val hbaseBr = ssc.sparkContext.broadcast(connection)
  rdd.foreach(x => {
    DataHandlingUtil.dataHandle(x, nameMap, dictBroadcast, platformMapBr, hbaseBr.value)
  })
})

ssc.start()
ssc.awaitTermination()

Answer 1

您应该使用以下代码，以使每个执行程序创建连接：

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

更好的版本：

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

注意：从Spark流编程指南中复制以上代码：https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html

另一种选择是使用内置了bulkGet，bulkPut，bulkDelete方法的HBaseContext。

这是示例代码：

val hbaseConf = HBaseConfiguration.create()
hbaseConf.set(HConstants.ZOOKEEPER_QUORUM, "hbase_URL")
hbaseConf.setInt(HConstants.ZOOKEEPER_CLIENT_PORT, 2181)
implicit val hbaseC = new HBaseContext(new SparkContext(new SparkConf()), hbaseConf)

关于HBaseContext的一句话：HBaseContext是所有Spark和HBase集成的根本。 HBaseContext接受HBase配置并将其推送到Spark执行程序。这样，我们就可以在静态位置为每个Spark Executor提供一个HBase连接。

有关更多详细信息，请转到此链接https://hbase.apache.org/book.html

如何让每个执行程序从hbase读取或写入？

1 个答案: