我试图广播连接,但我不知道如何解决序列化问题。
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
messages.map(_._2).filter(_.length > 0).foreachRDD(rdd => {
val hbaseConf = HBaseConfiguration.create
hbaseConf.set("hbase.rootdir", "hdfs://xxx.xxx.xxx.xxx:9000/hbase")
hbaseConf.set("hbase.zookeeper.quorum", "Master,slave1,slave2")
val connection = ConnectionFactory.createConnection(hbaseConf)
val hbaseBr = ssc.sparkContext.broadcast(connection)
rdd.foreach(x => {
DataHandlingUtil.dataHandle(x, nameMap, dictBroadcast, platformMapBr, hbaseBr.value)
})
})
ssc.start()
ssc.awaitTermination()
答案 0 :(得分:0)
您应该使用以下代码,以使每个执行程序创建连接:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val connection = createNewConnection()
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
}
}
更好的版本:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
注意:从Spark流编程指南中复制以上代码:https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html
另一种选择是使用内置了bulkGet,bulkPut,bulkDelete方法的HBaseContext。
这是示例代码:
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set(HConstants.ZOOKEEPER_QUORUM, "hbase_URL")
hbaseConf.setInt(HConstants.ZOOKEEPER_CLIENT_PORT, 2181)
implicit val hbaseC = new HBaseContext(new SparkContext(new SparkConf()), hbaseConf)
关于HBaseContext的一句话:HBaseContext是所有Spark和HBase集成的根本。 HBaseContext接受HBase配置并将其推送到Spark执行程序。这样,我们就可以在静态位置为每个Spark Executor提供一个HBase连接。
有关更多详细信息,请转到此链接https://hbase.apache.org/book.html