如何使用Spark Streaming计算过去25天来自Kafka的流数据的均值和方差

时间:2016-11-23 01:52:04

标签: apache-spark spark-streaming

kafka中有流数据,连续浮点数:

2016-11-23 11:00:00 | 12.2

2016-11-23 11:03:00 | 13.2

2016-11-23 11:05:00 | 15.1

...

我想计算过去25天上午11:00到凌晨12:00之间这些浮点数的均值和方差。

火花流适合处理这个问题吗?

非常感谢!

1 个答案:

答案 0 :(得分:0)

@Ming,您可以将其用作抽象

val sparkConf = new SparkConf().setAppName("StreamCount")
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    //update the time according to your need

    // Create direct kafka stream with brokers and topics
    val topicsSet = topics.split(",").toSet
    val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topicsSet)

    // Get the lines, and timestamp data along with the float values


    SELECT    float_number
FROM      [YourTable]
WHERE     [YourDate] BETWEEN DATEADD(DAY, DATEDIFF(DAY, 0, GETDATE()), 0) + '11:00' AND DATEADD(DAY, DATEDIFF(DAY, 0, GETDATE()), 0) + '12:00'
//store it to a data frame

df.select(avg($"float_number")).show()