我几天前发布了另一个类似问题的问题:
我设法至少得到一个"工作"解决方案现在,意味着流程本身似乎正常工作。但是,由于我是一个关于Spark的血腥初学者,我似乎错过了一些关于如何以正确的方式构建这些应用程序的方法(性能/计算方面)......
我想做什么:
在应用程序启动时从ElasticSearch加载历史记录数据
使用Spark Streaming开始在启动时收听Kafka主题(使用销售事件,以JSON字符串形式传递)
我的代码如下:
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{DataFrame, SaveMode, SQLContext}
import org.elasticsearch.spark.sql._
import org.apache.log4j.Logger
import org.apache.log4j.Level
object ReadFromKafkaAndES {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("kafka").setLevel(Level.WARN)
val checkpointDirectory = "/tmp/Spark"
val conf = new SparkConf().setAppName("Read Kafka JSONs").setMaster("local[4]")
conf.set("es.nodes", "localhost")
conf.set("es.port", "9200")
val topicsSet = Array("sales").toSet
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(15))
ssc.checkpoint(checkpointDirectory)
//Create SQLContect
val sqlContext = new SQLContext(sc)
//Get history data from ES
var history = sqlContext.esDF("data/salesaggregation")
//Kafka settings
val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092")
// Create direct kafka stream with brokers and topics
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
//Iterate
messages.foreachRDD { rdd =>
//If data is present, continue
if (rdd.count() > 0) {
//Register temporary table for the aggregated history
history.registerTempTable("history")
println("--- History -------------------------------")
history.show()
//Parse JSON as DataFrame
val saleEvents = sqlContext.read.json(rdd.values)
//Register temporary table for sales events
saleEvents.registerTempTable("sales")
val sales = sqlContext.sql("select userId, cast(max(saleTimestamp) as Timestamp) as latestSaleTimestamp, sum(totalRevenue) as totalRevenue, sum(totalPoints) as totalPoints from sales group by userId")
println("--- Sales ---------------------------------")
sales.show()
val agg = sqlContext.sql("select a.userId, max(a.latestSaleTimestamp) as latestSaleTimestamp, sum(a.totalRevenue) as totalRevenue, sum(a.totalPoints) as totalPoints from ((select userId, latestSaleTimestamp, totalRevenue, totalPoints from history) union all (select userId, cast(max(saleTimestamp) as Timestamp) as latestSaleTimestamp, sum(totalRevenue) as totalRevenue, sum(totalPoints) as totalPoints from sales group by userId)) a group by userId")
println("--- Aggregation ---------------------------")
agg.show()
//This is our new "history"
history = agg
//Cache results
history.cache()
//Drop temporary table
sqlContext.dropTempTable("history")
}
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}
计算似乎正常工作:
--- History -------------------------------
+--------------------+--------------------+-----------+------------+------+
| latestSaleTimestamp| productList|totalPoints|totalRevenue|userId|
+--------------------+--------------------+-----------+------------+------+
|2015-07-22 10:03:...|Buffer(47, 1484, ...| 91| 12.05| 23|
|2015-07-22 12:50:...|Buffer(256, 384, ...| 41| 7.05| 24|
+--------------------+--------------------+-----------+------------+------+
--- Sales ---------------------------------
+------+--------------------+------------------+-----------+
|userId| latestSaleTimestamp| totalRevenue|totalPoints|
+------+--------------------+------------------+-----------+
| 23|2015-07-29 09:17:...| 255.59| 208|
| 24|2015-07-29 09:17:...|226.08999999999997| 196|
+------+--------------------+------------------+-----------+
--- Aggregation ---------------------------
+------+--------------------+------------------+-----------+
|userId| latestSaleTimestamp| totalRevenue|totalPoints|
+------+--------------------+------------------+-----------+
| 23|2015-07-29 09:17:...| 267.6400001907349| 299|
| 24|2015-07-29 09:17:...|233.14000019073484| 237|
+------+--------------------+------------------+-----------+
但如果应用程序运行多次迭代,我可以看到性能下降:
我还看到大量跳过的任务,每次迭代都会增加:
第一次迭代的图形看起来像
第二次迭代的图表看起来像
迭代次数越多,图表得到的时间越长,跳过的步骤就越多。
基本上,我认为问题在于存储迭代'下一次迭代的结果。不幸的是,在尝试了很多不同的事情并阅读文档之后,我也无法为此提出解决方案。任何帮助都热烈赞赏。谢谢!
答案 0 :(得分:2)
这个流媒体工作不在“死锁”中。但是它的执行时间随着每次迭代呈指数级增长,导致流媒体作业很快就会失败。
对RDD的union-> reduce-> union-> reduce ...迭代过程创建了一个不断增加的RDD谱系。每次迭代都会为需要在下一次迭代中计算的谱系添加依赖关系,从而导致执行时间增加。依赖(沿袭)图表清楚地表明了这一点。
一种解决方案是定期检查RDD。
history.checkpoint()
您还可以探索用updateStateByKey