我每隔10分钟就会使用火花流来接收来自kakfa经纪人的通话记录。我想将这些记录插入一些temptable(全局?)并在我从Kakfa收到后立即插入。
请注意我不想存放在配置单元。在每次插入后,我想检查特定号码的呼叫是否超过20(例如)。下面是我编写的代码,它将每个rdd
转换为df
,然后创建临时视图。但是,我猜该视图仅包含最后RDD
。如何在同一个视图中插入记录并稍后运行sql?
val topics = Array("AIRDRMAIN", "")
val messages = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val Lines = messages.map(line => line.value())
val AirDRStream = Lines.map(AirDRFilter.parseAirDR)
AirDRStream.foreachRDD(foreachFunc = rdd => {
System.out.println("--- New RDD with " + rdd.count() + " records");
if (rdd.count() == 0) {
println("---WANG No logs received in this time interval=================")
} else {
val sqlContext = SparkSession
.builder()
.appName("Spark SQL basic example")
.getOrCreate()
import sqlContext.implicits._
rdd.toDF().createOrReplaceTempView("AIR")
val FilteredDR = sqlContext.sql("select refillProfileID, count(*) from AIR group by refillProfileID")
FilteredDR.show()
}
})
streamingContext.start()
streamingContext.awaitTermination()
下面是添加globalTempView逻辑后的更新代码。
val schema_string = "subscriberNumber, originNodeType, originHostName, originOperatorID, originTimeStamp, currentServiceClass, voucherBasedRefill, transactionAmount, refillProfileID, voucherGroupID, externalData1, externalData2"
val schema_rdd = StructType(schema_string.split(",")
.map(fieldName => StructField(fieldName, StringType, true)))
val init_df = sqlContext.createDataFrame(sc.emptyRDD[Row], schema_rdd)
println("initial count of initial RDD is " + init_df.count())
init_df.createGlobalTempView("AIRGLOBAL")
AirDRStream.foreachRDD(foreachFunc = rdd => {
System.out.println("--- New RDD with " + rdd.count() + " records");
if (rdd.count() == 0) {
println("--- No logs received in this time interval=================")
} else {
init_df.union(rdd.toDF())
println("after union count of initial RDD is " + init_df.count())
rdd.toDF().createOrReplaceTempView("AIR")
val FilteredDR = sqlContext.sql("select count(*) from AIR ")
val globalviewinsert = sqlContext.sql("Insert into global_temp.AIRGLOBAL select * from AIR ")
val globalview = sqlContext.sql("SELECT COUNT(*) FROM global_temp.AIRGLOBAL ")
FilteredDR.show()
globalviewinsert.show()
globalview.show()
}
})
streamingContext.start()
streamingContext.awaitTermination()
答案 0 :(得分:0)
您可以创建全局临时视图。引自文档
Spark SQL中的临时视图是会话范围的,如果是,则会消失 创建它的会话终止。如果你想拥有一个 在所有会话之间共享的临时视图,并保持活着直到 Spark应用程序终止,您可以创建一个全局临时 视图。全局临时视图与系统保留的数据库相关联 global_temp,我们必须使用限定名称来引用它,例如 SELECT * FROM global_temp.view1。
// Register the DataFrame as a global temporary view
df.createGlobalTempView("people")
// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()