reduceByKey不适用于火花流

时间:2016-10-07 17:16:02

标签: apache-spark apache-kafka spark-streaming

我有以下代码片段,其中reduceByKey似乎不起作用。

val myKafkaMessageStream = KafkaUtils.createDirectStream[String, String](
  ssc,
  PreferConsistent,
  Subscribe[String, String](topicsSet, kafkaParams)
)

myKafkaMessageStream
  .foreachRDD { rdd => 
    val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
    val myIter = rdd.mapPartitionsWithIndex { (i, iter) =>
      val offset = offsetRanges(i)
      iter.map(item => {
        (offset.fromOffset, offset.untilOffset, offset.topic, offset.partition, item)
      })
    }

    val myRDD = myIter.filter( (<filter_condition>) ).map(row => {
      //Process row

      ((field1, field2, field3) , (field4, field5))
    })

    val result = myRDD.reduceByKey((a,b) => (a._1+b._1, a._2+b._2))

    result.foreachPartition { partitionOfRecords =>
      //I don't get the reduced result here
      val connection = createNewConnection()
      partitionOfRecords.foreach(record => connection.send(record))
      connection.close()
    }        
  }

我错过了什么吗?

1 个答案:

答案 0 :(得分:2)

在流式传输情况下,使用reduceByKeyAndWindow来表达您正在寻找的内容更有意义,但是在特定的时间范围内。

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

“在(K,V)对的DStream上调用时,返回一个新的(K,V)对DStream,其中每个键的值在滑动窗口中使用给定的reduce函数func进行聚合。 :默认情况下,这使用Spark的默认并行任务数(本地模式为2,在群集模式下,数量由配置属性spark.default.parallelism确定)进行分组。您可以将可选的numTasks参数传递给set不同数量的任务。“

http://spark.apache.org/docs/latest/streaming-programming-guide.html