我在运行Spark Streaming应用程序时遇到了一些“沿袭问题”,该应用程序在启动时从Elasticsearch加载历史数据,并使用来自Apache Kafka消息的数据更新此数据。
前段时间我发布了一个问题,因为过了一段时间,我的应用程序运行时遇到了一种死锁,这意味着计算结果所需的时间比流式窗口长,从而使其停滞不前。
见
建议使用检查点,我试过。不过,问题仍然存在。
这是我目前的示例代码:
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{DataFrame, SaveMode, SQLContext}
import org.elasticsearch.spark.sql._
import org.apache.log4j.Logger
import org.apache.log4j.Level
object ReadFromKafkaAndES {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("kafka").setLevel(Level.WARN)
System.setProperty("hadoop.home.dir", "D:\\Development\\fake-hadoop-for-spark")
val checkpointDirectory = "D:/tmp/Spark"
val conf = new SparkConf().setAppName("Read Kafka JSONs").setMaster("local[4]")
conf.set("es.nodes", "localhost")
conf.set("es.port", "9200")
val topicsSet = Array("sales").toSet
var offsetRanges = Array[OffsetRange]()
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(20))
ssc.checkpoint(checkpointDirectory)
//Create SQLContect
val sqlContext = new SQLContext(sc)
//Get history data from ES
var history : DataFrame = sqlContext.esDF("data/salesaggregation")
//Kafka settings
val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092")
// Create direct kafka stream with brokers and topics
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
//Iterate
messages.transform { rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.foreachRDD { rdd =>
//If data is present, continue
if (rdd.count() > 0) {
//Register temporary table for the aggregated history
history.registerTempTable("history")
println("--- History -------------------------------")
history.show()
//Parse JSON as DataFrame
val saleEvents = sqlContext.read.json(rdd.values)
//Register temporary table for sales events
saleEvents.registerTempTable("sales")
val sales = sqlContext.sql("select userId, cast(max(saleTimestamp) as Timestamp) as latestSaleTimestamp, sum(totalRevenue) as totalRevenue, sum(totalPoints) as totalPoints from sales group by userId")
println("--- Sales ---------------------------------")
sales.show()
val agg = sqlContext.sql("select a.userId, max(a.latestSaleTimestamp) as latestSaleTimestamp, sum(a.totalRevenue) as totalRevenue, sum(a.totalPoints) as totalPoints from ((select userId, latestSaleTimestamp, totalRevenue, totalPoints from history) union all (select userId, cast(max(saleTimestamp) as Timestamp) as latestSaleTimestamp, sum(totalRevenue) as totalRevenue, sum(totalPoints) as totalPoints from sales group by userId)) a group by userId")
println("--- Aggregation ---------------------------")
agg.show()
//This is our new "history"
history = agg.toDF()
//As recommended, call checkpoint()
history.rdd.checkpoint()
//Save to Elasticsearch
history.saveToEs("data/salesaggregation", Map("es.mapping.id" -> "userId"))
}
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}
我的想法中有问题吗?代码本身运行顺利,但问题仍然存在。
我也尝试使用updatestatebykey,但它让我很难成为Spark和Scala的初学者,因为根据我的理解它只适用于DStream
对,我将为每个userId
更新一些字段。
修改
根据Holden在汇总声明后的回答,我添加了history.explain(true)
。这表明谱系随着每次迭代而增加:
第一次迭代:
== Parsed Logical Plan ==
Repartition 4, false
Aggregate [userId#4L], [userId#4L,MAX(latestSaleTimestamp#0) AS latestSaleTimestamp#13,SUM(totalRevenue#28) AS totalRevenue#14,SUM(totalPoints#29L) AS totalPoints#15L]
Subquery a
Union
Project [userId#4L,latestSaleTimestamp#0,CAST(totalRevenue#3, DoubleType) AS totalRevenue#28,CAST(totalPoints#2, LongType) AS totalPoints#29L]
Project [userId#4L,latestSaleTimestamp#0,totalRevenue#3,totalPoints#2]
Subquery history
LogicalRDD [latestSaleTimestamp#0,productList#1,totalPoints#2,totalRevenue#3,userId#4L], MapPartitionsRDD[1] at createDataFrame at EsSparkSQL.scala:28
Aggregate [userId#12L], [userId#12L,CAST(MAX(saleTimestamp#9L), TimestampType) AS latestSaleTimestamp#25,SUM(totalRevenue#11) AS totalRevenue#26,SUM(totalPoints#10L) AS totalPoints#27L]
Subquery sales
Relation[lineItems#5,otherRevenue#6,productList#7,productRevenue#8,saleTimestamp#9L,totalPoints#10L,totalRevenue#11,userId#12L] org.apache.spark.sql.json.JSONRelation@eecc133
== Analyzed Logical Plan ==
userId: bigint, latestSaleTimestamp: timestamp, totalRevenue: double, totalPoints: bigint
Repartition 4, false
Aggregate [userId#4L], [userId#4L,MAX(latestSaleTimestamp#0) AS latestSaleTimestamp#13,SUM(totalRevenue#28) AS totalRevenue#14,SUM(totalPoints#29L) AS totalPoints#15L]
Subquery a
Union
Project [userId#4L,latestSaleTimestamp#0,CAST(totalRevenue#3, DoubleType) AS totalRevenue#28,CAST(totalPoints#2, LongType) AS totalPoints#29L]
Project [userId#4L,latestSaleTimestamp#0,totalRevenue#3,totalPoints#2]
Subquery history
LogicalRDD [latestSaleTimestamp#0,productList#1,totalPoints#2,totalRevenue#3,userId#4L], MapPartitionsRDD[1] at createDataFrame at EsSparkSQL.scala:28
Aggregate [userId#12L], [userId#12L,CAST(MAX(saleTimestamp#9L), TimestampType) AS latestSaleTimestamp#25,SUM(totalRevenue#11) AS totalRevenue#26,SUM(totalPoints#10L) AS totalPoints#27L]
Subquery sales
Relation[lineItems#5,otherRevenue#6,productList#7,productRevenue#8,saleTimestamp#9L,totalPoints#10L,totalRevenue#11,userId#12L] org.apache.spark.sql.json.JSONRelation@eecc133
== Optimized Logical Plan ==
Repartition 4, false
Aggregate [userId#4L], [userId#4L,MAX(latestSaleTimestamp#0) AS latestSaleTimestamp#13,SUM(totalRevenue#28) AS totalRevenue#14,SUM(totalPoints#29L) AS totalPoints#15L]
Union
Project [userId#4L,latestSaleTimestamp#0,CAST(totalRevenue#3, DoubleType) AS totalRevenue#28,CAST(totalPoints#2, LongType) AS totalPoints#29L]
LogicalRDD [latestSaleTimestamp#0,productList#1,totalPoints#2,totalRevenue#3,userId#4L], MapPartitionsRDD[1] at createDataFrame at EsSparkSQL.scala:28
Aggregate [userId#12L], [userId#12L,CAST(MAX(saleTimestamp#9L), TimestampType) AS latestSaleTimestamp#25,SUM(totalRevenue#11) AS totalRevenue#26,SUM(totalPoints#10L) AS totalPoints#27L]
Project [userId#12L,saleTimestamp#9L,totalRevenue#11,totalPoints#10L]
Relation[lineItems#5,otherRevenue#6,productList#7,productRevenue#8,saleTimestamp#9L,totalPoints#10L,totalRevenue#11,userId#12L] org.apache.spark.sql.json.JSONRelation@eecc133
== Physical Plan ==
Repartition 4, false
Aggregate false, [userId#4L], [userId#4L,MAX(PartialMax#34) AS latestSaleTimestamp#13,CombineSum(PartialSum#35) AS totalRevenue#14,CombineSum(PartialSum#36L) AS totalPoints#15L]
Exchange (HashPartitioning 200)
Aggregate true, [userId#4L], [userId#4L,MAX(latestSaleTimestamp#0) AS PartialMax#34,SUM(totalRevenue#28) AS PartialSum#35,SUM(totalPoints#29L) AS PartialSum#36L]
Union
Project [userId#4L,latestSaleTimestamp#0,CAST(totalRevenue#3, DoubleType) AS totalRevenue#28,CAST(totalPoints#2, LongType) AS totalPoints#29L]
PhysicalRDD [latestSaleTimestamp#0,productList#1,totalPoints#2,totalRevenue#3,userId#4L], MapPartitionsRDD[1] at createDataFrame at EsSparkSQL.scala:28
Aggregate false, [userId#12L], [userId#12L,CAST(MAX(PartialMax#40L), TimestampType) AS latestSaleTimestamp#25,CombineSum(PartialSum#41) AS totalRevenue#26,CombineSum(PartialSum#42L) AS totalPoints#27L]
Exchange (HashPartitioning 200)
Aggregate true, [userId#12L], [userId#12L,MAX(saleTimestamp#9L) AS PartialMax#40L,SUM(totalRevenue#11) AS PartialSum#41,SUM(totalPoints#10L) AS PartialSum#42L]
PhysicalRDD [userId#12L,saleTimestamp#9L,totalRevenue#11,totalPoints#10L], MapPartitionsRDD[6] at foreachRDD at ReadFromKafkaAndES.scala:51
Code Generation: false
== RDD ==
第二次迭代:
== Parsed Logical Plan ==
Repartition 4, false
Aggregate [userId#4L], [userId#4L,MAX(latestSaleTimestamp#13) AS latestSaleTimestamp#147,SUM(totalRevenue#14) AS totalRevenue#148,SUM(totalPoints#15L) AS totalPoints#149L]
Subquery a
Union
Project [userId#4L,latestSaleTimestamp#13,totalRevenue#14,totalPoints#15L]
Subquery history
Repartition 4, false
Aggregate [userId#4L], [userId#4L,MAX(latestSaleTimestamp#0) AS latestSaleTimestamp#13,SUM(totalRevenue#28) AS totalRevenue#14,SUM(totalPoints#29L) AS totalPoints#15L]
Subquery a
Union
Project [userId#4L,latestSaleTimestamp#0,CAST(totalRevenue#3, DoubleType) AS totalRevenue#28,CAST(totalPoints#2, LongType) AS totalPoints#29L]
Project [userId#4L,latestSaleTimestamp#0,totalRevenue#3,totalPoints#2]
Subquery history
LogicalRDD [latestSaleTimestamp#0,productList#1,totalPoints#2,totalRevenue#3,userId#4L], MapPartitionsRDD[1] at createDataFrame at EsSparkSQL.scala:28
Aggregate [userId#12L], [userId#12L,CAST(MAX(saleTimestamp#9L), TimestampType) AS latestSaleTimestamp#25,SUM(totalRevenue#11) AS totalRevenue#26,SUM(totalPoints#10L) AS totalPoints#27L]
Subquery sales
Relation[lineItems#5,otherRevenue#6,productList#7,productRevenue#8,saleTimestamp#9L,totalPoints#10L,totalRevenue#11,userId#12L] org.apache.spark.sql.json.JSONRelation@eecc133
Aggregate [userId#146L], [userId#146L,CAST(MAX(saleTimestamp#143L), TimestampType) AS latestSaleTimestamp#159,SUM(totalRevenue#145) AS totalRevenue#160,SUM(totalPoints#144L) AS totalPoints#161L]
Subquery sales
Relation[lineItems#139,otherRevenue#140,productList#141,productRevenue#142,saleTimestamp#143L,totalPoints#144L,totalRevenue#145,userId#146L] org.apache.spark.sql.json.JSONRelation@eecc133
== Analyzed Logical Plan ==
userId: bigint, latestSaleTimestamp: timestamp, totalRevenue: double, totalPoints: bigint
Repartition 4, false
Aggregate [userId#4L], [userId#4L,MAX(latestSaleTimestamp#13) AS latestSaleTimestamp#147,SUM(totalRevenue#14) AS totalRevenue#148,SUM(totalPoints#15L) AS totalPoints#149L]
Subquery a
Union
Project [userId#4L,latestSaleTimestamp#13,totalRevenue#14,totalPoints#15L]
Subquery history
Repartition 4, false
Aggregate [userId#4L], [userId#4L,MAX(latestSaleTimestamp#0) AS latestSaleTimestamp#13,SUM(totalRevenue#28) AS totalRevenue#14,SUM(totalPoints#29L) AS totalPoints#15L]
Subquery a
Union
Project [userId#4L,latestSaleTimestamp#0,CAST(totalRevenue#3, DoubleType) AS totalRevenue#28,CAST(totalPoints#2, LongType) AS totalPoints#29L]
Project [userId#4L,latestSaleTimestamp#0,totalRevenue#3,totalPoints#2]
Subquery history
LogicalRDD [latestSaleTimestamp#0,productList#1,totalPoints#2,totalRevenue#3,userId#4L], MapPartitionsRDD[1] at createDataFrame at EsSparkSQL.scala:28
Aggregate [userId#12L], [userId#12L,CAST(MAX(saleTimestamp#9L), TimestampType) AS latestSaleTimestamp#25,SUM(totalRevenue#11) AS totalRevenue#26,SUM(totalPoints#10L) AS totalPoints#27L]
Subquery sales
Relation[lineItems#5,otherRevenue#6,productList#7,productRevenue#8,saleTimestamp#9L,totalPoints#10L,totalRevenue#11,userId#12L] org.apache.spark.sql.json.JSONRelation@eecc133
Aggregate [userId#146L], [userId#146L,CAST(MAX(saleTimestamp#143L), TimestampType) AS latestSaleTimestamp#159,SUM(totalRevenue#145) AS totalRevenue#160,SUM(totalPoints#144L) AS totalPoints#161L]
Subquery sales
Relation[lineItems#139,otherRevenue#140,productList#141,productRevenue#142,saleTimestamp#143L,totalPoints#144L,totalRevenue#145,userId#146L] org.apache.spark.sql.json.JSONRelation@eecc133
== Optimized Logical Plan ==
Repartition 4, false
Aggregate [userId#4L], [userId#4L,MAX(latestSaleTimestamp#13) AS latestSaleTimestamp#147,SUM(totalRevenue#14) AS totalRevenue#148,SUM(totalPoints#15L) AS totalPoints#149L]
Union
Repartition 4, false
Aggregate [userId#4L], [userId#4L,MAX(latestSaleTimestamp#0) AS latestSaleTimestamp#13,SUM(totalRevenue#28) AS totalRevenue#14,SUM(totalPoints#29L) AS totalPoints#15L]
Union
Project [userId#4L,latestSaleTimestamp#0,CAST(totalRevenue#3, DoubleType) AS totalRevenue#28,CAST(totalPoints#2, LongType) AS totalPoints#29L]
LogicalRDD [latestSaleTimestamp#0,productList#1,totalPoints#2,totalRevenue#3,userId#4L], MapPartitionsRDD[1] at createDataFrame at EsSparkSQL.scala:28
Aggregate [userId#12L], [userId#12L,CAST(MAX(saleTimestamp#9L), TimestampType) AS latestSaleTimestamp#25,SUM(totalRevenue#11) AS totalRevenue#26,SUM(totalPoints#10L) AS totalPoints#27L]
Project [userId#12L,saleTimestamp#9L,totalRevenue#11,totalPoints#10L]
Relation[lineItems#5,otherRevenue#6,productList#7,productRevenue#8,saleTimestamp#9L,totalPoints#10L,totalRevenue#11,userId#12L] org.apache.spark.sql.json.JSONRelation@eecc133
Aggregate [userId#146L], [userId#146L,CAST(MAX(saleTimestamp#143L), TimestampType) AS latestSaleTimestamp#159,SUM(totalRevenue#145) AS totalRevenue#160,SUM(totalPoints#144L) AS totalPoints#161L]
Project [userId#146L,saleTimestamp#143L,totalRevenue#145,totalPoints#144L]
Relation[lineItems#139,otherRevenue#140,productList#141,productRevenue#142,saleTimestamp#143L,totalPoints#144L,totalRevenue#145,userId#146L] org.apache.spark.sql.json.JSONRelation@eecc133
== Physical Plan ==
Repartition 4, false
Aggregate false, [userId#4L], [userId#4L,MAX(PartialMax#166) AS latestSaleTimestamp#147,CombineSum(PartialSum#167) AS totalRevenue#148,CombineSum(PartialSum#168L) AS totalPoints#149L]
Exchange (HashPartitioning 200)
Aggregate true, [userId#4L], [userId#4L,MAX(latestSaleTimestamp#13) AS PartialMax#166,SUM(totalRevenue#14) AS PartialSum#167,SUM(totalPoints#15L) AS PartialSum#168L]
Union
Repartition 4, false
Aggregate false, [userId#4L], [userId#4L,MAX(PartialMax#172) AS latestSaleTimestamp#13,CombineSum(PartialSum#173) AS totalRevenue#14,CombineSum(PartialSum#174L) AS totalPoints#15L]
Exchange (HashPartitioning 200)
Aggregate true, [userId#4L], [userId#4L,MAX(latestSaleTimestamp#0) AS PartialMax#172,SUM(totalRevenue#28) AS PartialSum#173,SUM(totalPoints#29L) AS PartialSum#174L]
Union
Project [userId#4L,latestSaleTimestamp#0,CAST(totalRevenue#3, DoubleType) AS totalRevenue#28,CAST(totalPoints#2, LongType) AS totalPoints#29L]
PhysicalRDD [latestSaleTimestamp#0,productList#1,totalPoints#2,totalRevenue#3,userId#4L], MapPartitionsRDD[1] at createDataFrame at EsSparkSQL.scala:28
Aggregate false, [userId#12L], [userId#12L,CAST(MAX(PartialMax#178L), TimestampType) AS latestSaleTimestamp#25,CombineSum(PartialSum#179) AS totalRevenue#26,CombineSum(PartialSum#180L) AS totalPoints#27L]
Exchange (HashPartitioning 200)
Aggregate true, [userId#12L], [userId#12L,MAX(saleTimestamp#9L) AS PartialMax#178L,SUM(totalRevenue#11) AS PartialSum#179,SUM(totalPoints#10L) AS PartialSum#180L]
PhysicalRDD [userId#12L,saleTimestamp#9L,totalRevenue#11,totalPoints#10L], MapPartitionsRDD[45] at foreachRDD at ReadFromKafkaAndES.scala:51
Aggregate false, [userId#146L], [userId#146L,CAST(MAX(PartialMax#196L), TimestampType) AS latestSaleTimestamp#159,CombineSum(PartialSum#197) AS totalRevenue#160,CombineSum(PartialSum#198L) AS totalPoints#161L]
Exchange (HashPartitioning 200)
Aggregate true, [userId#146L], [userId#146L,MAX(saleTimestamp#143L) AS PartialMax#196L,SUM(totalRevenue#145) AS PartialSum#197,SUM(totalPoints#144L) AS PartialSum#198L]
PhysicalRDD [userId#146L,saleTimestamp#143L,totalRevenue#145,totalPoints#144L], MapPartitionsRDD[46] at foreachRDD at ReadFromKafkaAndES.scala:51
Code Generation: false
== RDD ==
所以,似乎检查点没有达到预期的效果,但我仍然无法弄清楚如何解决这个问题。谢谢你的提示!
答案 0 :(得分:1)
我建议更多地研究updateStateByKey
并考虑对更新进行平面映射(因此每条输入记录都可以创建多个可能的更新)。
有一件有趣的事情是,你检查了历史DataFrame的rdd,但是当你将它注册为临时表时,它总是使用逻辑计划,因此你的检查点可能没有做任何事情(但我不确定及其关闭)到凌晨1点,所以我在你的调试中添加一个explain
,你可以看到这个谱系最终看起来像。)