Question

我几天前发布了另一个类似问题的问题：

How to load history data when starting Spark Streaming process, and calculate running aggregations

我设法至少得到一个＆＃34;工作＆＃34;解决方案现在，意味着流程本身似乎正常工作。但是，由于我是一个关于Spark的血腥初学者，我似乎错过了一些关于如何以正确的方式构建这些应用程序的方法（性能/计算方面）......

我想做什么：

在应用程序启动时从ElasticSearch加载历史记录数据
使用Spark Streaming开始在启动时收听Kafka主题（使用销售事件，以JSON字符串形式传递）
对于每个传入的RDD，按用户执行聚合
将结果与历史
汇总新值，例如每位用户的总收入
使用5.的结果作为新的＆＃34;历史＆＃34;下一次迭代

我的代码如下：

import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{DataFrame, SaveMode, SQLContext}
import org.elasticsearch.spark.sql._
import org.apache.log4j.Logger
import org.apache.log4j.Level

object ReadFromKafkaAndES {
  def main(args: Array[String]) {

    Logger.getLogger("org").setLevel(Level.WARN)
    Logger.getLogger("akka").setLevel(Level.WARN)
    Logger.getLogger("kafka").setLevel(Level.WARN)

    val checkpointDirectory = "/tmp/Spark"
    val conf = new SparkConf().setAppName("Read Kafka JSONs").setMaster("local[4]")
    conf.set("es.nodes", "localhost")
    conf.set("es.port", "9200")

    val topicsSet = Array("sales").toSet

    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc, Seconds(15))
    ssc.checkpoint(checkpointDirectory)

    //Create SQLContect
    val sqlContext = new SQLContext(sc)

    //Get history data from ES
    var history = sqlContext.esDF("data/salesaggregation")

    //Kafka settings
    val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092")

    // Create direct kafka stream with brokers and topics
    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topicsSet)

    //Iterate
    messages.foreachRDD { rdd =>

      //If data is present, continue
      if (rdd.count() > 0) {

        //Register temporary table for the aggregated history
        history.registerTempTable("history")

        println("--- History -------------------------------")
        history.show()

        //Parse JSON as DataFrame
        val saleEvents = sqlContext.read.json(rdd.values)

        //Register temporary table for sales events
        saleEvents.registerTempTable("sales")

        val sales = sqlContext.sql("select userId, cast(max(saleTimestamp) as Timestamp) as latestSaleTimestamp, sum(totalRevenue) as totalRevenue, sum(totalPoints) as totalPoints from sales group by userId")

        println("--- Sales ---------------------------------")
        sales.show()

        val agg = sqlContext.sql("select a.userId, max(a.latestSaleTimestamp) as latestSaleTimestamp, sum(a.totalRevenue) as totalRevenue, sum(a.totalPoints) as totalPoints from ((select userId, latestSaleTimestamp, totalRevenue, totalPoints from history) union all (select userId, cast(max(saleTimestamp) as Timestamp) as latestSaleTimestamp, sum(totalRevenue) as totalRevenue, sum(totalPoints) as totalPoints from sales group by userId)) a group by userId")

        println("--- Aggregation ---------------------------")
        agg.show()

        //This is our new "history"
        history = agg

        //Cache results
        history.cache()

        //Drop temporary table
        sqlContext.dropTempTable("history")

      }

    }

    // Start the computation
    ssc.start()
    ssc.awaitTermination()
  }
}

计算似乎正常工作：

--- History -------------------------------
+--------------------+--------------------+-----------+------------+------+
| latestSaleTimestamp|         productList|totalPoints|totalRevenue|userId|
+--------------------+--------------------+-----------+------------+------+
|2015-07-22 10:03:...|Buffer(47, 1484, ...|         91|       12.05|    23|
|2015-07-22 12:50:...|Buffer(256, 384, ...|         41|        7.05|    24|
+--------------------+--------------------+-----------+------------+------+

--- Sales ---------------------------------
+------+--------------------+------------------+-----------+
|userId| latestSaleTimestamp|      totalRevenue|totalPoints|
+------+--------------------+------------------+-----------+
|    23|2015-07-29 09:17:...|            255.59|        208|
|    24|2015-07-29 09:17:...|226.08999999999997|        196|
+------+--------------------+------------------+-----------+

--- Aggregation ---------------------------
+------+--------------------+------------------+-----------+
|userId| latestSaleTimestamp|      totalRevenue|totalPoints|
+------+--------------------+------------------+-----------+
|    23|2015-07-29 09:17:...| 267.6400001907349|        299|
|    24|2015-07-29 09:17:...|233.14000019073484|        237|
+------+--------------------+------------------+-----------+

但如果应用程序运行多次迭代，我可以看到性能下降：