我有一个针对Spark流的自定义foreach编写器。对于我写入JDBC源的每一行。在执行JDBC操作之前,我还想做一些快速查找,并在执行JDBC操作后更新值,如下面的示例代码中的“Step-1”和“Step-3”......
我不想使用像REDIS,MongoDB这样的外部数据库。我想要一些像RocksDB,Derby等低脚印的东西......
我可以为每个应用程序存储一个文件,就像检查点一样,我将创建一个内部db文件夹......
我看不到Spark的任何内存数据库..
def main(args: Array[String]): Unit = {
val brokers = "quickstart:9092"
val topic = "safe_message_landing_app_4"
val sparkSession = SparkSession.builder().master("local[*]").appName("Ganesh-Kafka-JDBC-Streaming").getOrCreate();
val sparkContext = sparkSession.sparkContext;
sparkContext.setLogLevel("ERROR")
val sqlContext = sparkSession.sqlContext;
val kafkaDataframe = sparkSession.readStream.format("kafka")
.options(Map("kafka.bootstrap.servers" -> brokers, "subscribe" -> topic,
"startingOffsets" -> "latest", "group.id" -> " Jai Ganesh", "checkpoint" -> "cp/kafka_reader"))
.load()
kafkaDataframe.printSchema()
kafkaDataframe.createOrReplaceTempView("kafka_view")
val sqlDataframe = sqlContext.sql("select concat ( topic, '-' , partition, '-' , offset) as KEY, string(value) as VALUE from kafka_view")
val customForEachWriter = new ForeachWriter[Row] {
override def open(partitionId: Long, version: Long) = {
println("Open Started ==> partitionId ==> " + partitionId + " ==> version ==> " + version)
true
}
override def process(value: Row) = {
// Step 1 ==> Lookup a key in persistent KEY-VALUE store
// JDBC operations
// Step 3 ==> Update the value in persistent KEY-VALUE store
}
override def close(errorOrNull: Throwable) = {
println(" ************** Closed ****************** ")
}
}
val yy = sqlDataframe
.writeStream
.queryName("foreachquery")
.foreach(customForEachWriter)
.start()
yy.awaitTermination()
sparkSession.close();
}
答案 0 :(得分:2)
Manjesh,
您正在寻找什么," Spark和您的内存数据库作为一个无缝集群,共享一个进程空间",支持MVCC正是SnappyData提供的。使用SnappyData,您要快速查找的表与运行Spark流作业的进程相同。查看here
SnappyData拥有核心产品的Apache V2许可证,OSS下载中提供了您所指的具体用途。
(披露:我是一名SnappyData员工,因此产品是问题的答案,因此提供产品特定的答案是有意义的)