跑进"僵局"同时从Kafka进行流式聚合

时间:2015-07-29 07:45:57

标签: scala apache-spark spark-streaming apache-spark-sql

我几天前发布了另一个类似问题的问题:

我设法至少得到一个"工作"解决方案现在,意味着流程本身似乎正常工作。但是,由于我是一个关于Spark的血腥初学者,我似乎错过了一些关于如何以正确的方式构建这些应用程序的方法(性能/计算方面)......

我想做什么:

  1. 在应用程序启动时从ElasticSearch加载历史记录数据

  2. 使用Spark Streaming开始在启动时收听Kafka主题(使用销售事件,以JSON字符串形式传递)

  3. 对于每个传入的RDD,按用户执行聚合
  4. 将结果与历史
  5. 联合起来
  6. 汇总新值,例如每位用户的总收入
  7. 使用5.的结果作为新的"历史"下一次迭代
  8. 我的代码如下:

    import kafka.serializer.StringDecoder
    import org.apache.spark.streaming._
    import org.apache.spark.streaming.kafka._
    import org.apache.spark.{SparkContext, SparkConf}
    import org.apache.spark.sql.{DataFrame, SaveMode, SQLContext}
    import org.elasticsearch.spark.sql._
    import org.apache.log4j.Logger
    import org.apache.log4j.Level
    
    object ReadFromKafkaAndES {
      def main(args: Array[String]) {
    
        Logger.getLogger("org").setLevel(Level.WARN)
        Logger.getLogger("akka").setLevel(Level.WARN)
        Logger.getLogger("kafka").setLevel(Level.WARN)
    
        val checkpointDirectory = "/tmp/Spark"
        val conf = new SparkConf().setAppName("Read Kafka JSONs").setMaster("local[4]")
        conf.set("es.nodes", "localhost")
        conf.set("es.port", "9200")
    
        val topicsSet = Array("sales").toSet
    
        val sc = new SparkContext(conf)
        val ssc = new StreamingContext(sc, Seconds(15))
        ssc.checkpoint(checkpointDirectory)
    
        //Create SQLContect
        val sqlContext = new SQLContext(sc)
    
        //Get history data from ES
        var history = sqlContext.esDF("data/salesaggregation")
    
        //Kafka settings
        val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092")
    
        // Create direct kafka stream with brokers and topics
        val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
          ssc, kafkaParams, topicsSet)
    
        //Iterate
        messages.foreachRDD { rdd =>
    
          //If data is present, continue
          if (rdd.count() > 0) {
    
            //Register temporary table for the aggregated history
            history.registerTempTable("history")
    
            println("--- History -------------------------------")
            history.show()
    
            //Parse JSON as DataFrame
            val saleEvents = sqlContext.read.json(rdd.values)
    
            //Register temporary table for sales events
            saleEvents.registerTempTable("sales")
    
            val sales = sqlContext.sql("select userId, cast(max(saleTimestamp) as Timestamp) as latestSaleTimestamp, sum(totalRevenue) as totalRevenue, sum(totalPoints) as totalPoints from sales group by userId")
    
            println("--- Sales ---------------------------------")
            sales.show()
    
            val agg = sqlContext.sql("select a.userId, max(a.latestSaleTimestamp) as latestSaleTimestamp, sum(a.totalRevenue) as totalRevenue, sum(a.totalPoints) as totalPoints from ((select userId, latestSaleTimestamp, totalRevenue, totalPoints from history) union all (select userId, cast(max(saleTimestamp) as Timestamp) as latestSaleTimestamp, sum(totalRevenue) as totalRevenue, sum(totalPoints) as totalPoints from sales group by userId)) a group by userId")
    
            println("--- Aggregation ---------------------------")
            agg.show()
    
            //This is our new "history"
            history = agg
    
            //Cache results
            history.cache()
    
            //Drop temporary table
            sqlContext.dropTempTable("history")
    
          }
    
        }
    
        // Start the computation
        ssc.start()
        ssc.awaitTermination()
      }
    }
    

    计算似乎正常工作:

    --- History -------------------------------
    +--------------------+--------------------+-----------+------------+------+
    | latestSaleTimestamp|         productList|totalPoints|totalRevenue|userId|
    +--------------------+--------------------+-----------+------------+------+
    |2015-07-22 10:03:...|Buffer(47, 1484, ...|         91|       12.05|    23|
    |2015-07-22 12:50:...|Buffer(256, 384, ...|         41|        7.05|    24|
    +--------------------+--------------------+-----------+------------+------+
    
    --- Sales ---------------------------------
    +------+--------------------+------------------+-----------+
    |userId| latestSaleTimestamp|      totalRevenue|totalPoints|
    +------+--------------------+------------------+-----------+
    |    23|2015-07-29 09:17:...|            255.59|        208|
    |    24|2015-07-29 09:17:...|226.08999999999997|        196|
    +------+--------------------+------------------+-----------+
    
    --- Aggregation ---------------------------
    +------+--------------------+------------------+-----------+
    |userId| latestSaleTimestamp|      totalRevenue|totalPoints|
    +------+--------------------+------------------+-----------+
    |    23|2015-07-29 09:17:...| 267.6400001907349|        299|
    |    24|2015-07-29 09:17:...|233.14000019073484|        237|
    +------+--------------------+------------------+-----------+
    

    但如果应用程序运行多次迭代,我可以看到性能下降:

    Streaming Graphs

    我还看到大量跳过的任务,每次迭代都会增加:

    Skipped tasks

    第一次迭代的图形看起来像

    enter image description here

    第二次迭代的图表看起来像

    enter image description here

    迭代次数越多,图表得到的时间越长,跳过的步骤就越多。

    基本上,我认为问题在于存储迭代'下一次迭代的结果。不幸的是,在尝试了很多不同的事情并阅读文档之后,我也无法为此提出解决方案。任何帮助都热烈赞赏。谢谢!

1 个答案:

答案 0 :(得分:2)

这个流媒体工作不在“死锁”中。但是它的执行时间随着每次迭代呈指数级增长,导致流媒体作业很快就会失败。

对RDD的union-> reduce-> union-> reduce ...迭代过程创建了一个不断增加的RDD谱系。每次迭代都会为需要在下一次迭代中计算的谱系添加依赖关系,从而导致执行时间增加。依赖(沿袭)图表清楚地表明了这一点。

一种解决方案是定期检查RDD。

history.checkpoint()

您还可以探索用updateStateByKey

替换union / reduce进程