Apache Spark Scala API:Scala中的ReduceByKeyAndWindow

时间:2015-12-17 19:00:30

标签: scala apache-spark spark-streaming dstream

由于我是Spark的Scala API新手,我遇到以下问题:

在我的java代码中,我做了一个reduceByKeyAndWindow转换,但现在我看到,只有一个reduceByWindow(因为在Scala中也没有PairDStream)。但是,我现在在Scala工作的第一步:

import org.apache.hadoop.conf.Configuration;
import [...]

val serverIp = "xxx.xxx.xxx.xxx"
val receiverInstances = 2
val batchIntervalSec = 2
val windowSize1hSek = 60 * 60
val slideDurationSek = batchIntervalSec

val streamingCtx = new StreamingContext(sc, Seconds(batchIntervalSec))

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "xxx")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "xxx")

// ReceiverInputDStream
val receiver1 = streamingCtx.socketTextStream(serverIp, 7777)
val receiver2 = streamingCtx.socketTextStream(serverIp, 7778)

// DStream
val inputDStream = receiver1.union(receiver2)

// h.hh.plug.ts.val
case class DebsEntry(house: Integer, household: Integer, plug: Integer, ts: Long, value: Float)

// h.hh.plug.val
case class DebsEntryWithoutTs(house: Integer, household: Integer, plug: Integer, value: Float)

// h.hh.plug.1
case class DebsEntryWithoutTsCount(house: Integer, household: Integer, plug: Integer, count: Long)

val debsPairDStream = inputDStream.map(s => s.split(",")).map(s => DebsEntry(s(6).toInt, s(5).toInt, s(4).toInt, s(1).toLong, s(2).toFloat)) //.foreachRDD(rdd => rdd.toDF().registerTempTable("test"))

val debsPairDStreamWithoutDuplicates = debsPairDStream.transform(s => s.distinct())

val debsPairDStreamConsumptionGreater0 = debsPairDStreamWithoutDuplicates.filter(s => s.value > 100.0)

debsPairDStreamConsumptionGreater0.foreachRDD(rdd => rdd.toDF().registerTempTable("test3"))

val debsPairDStreamConsumptionGreater0withoutTs = debsPairDStreamConsumptionGreater0.map(s => DebsEntryWithoutTs(s.house, s.household, s.plug, s.value))

// 5.) Average per Plug
// 5.1) Create a count-prepared PairDStream (house, household, plug, 1)
val countPreparedPerPlug1h = debsPairDStreamConsumptionGreater0withoutTs.map(s => DebsEntryWithoutTsCount(s.house, s.household, s.plug, 1))

// 5.2) ReduceByKeyAndWindow
val countPerPlug1h = countPreparedPerPlug1h.reduceByWindow(...???...)

直到步骤5.1一切正常。在5.2中,我现在想总结 countPreparedPerPlug1h 的1,但只有在其他属性(house,household,plug)相等的情况下。 - 目标是获得每个(住宅,家庭,插头)组合的入门次数。有人可以帮忙吗?谢谢!

编辑 - 首先尝试

我在步骤5.2中尝试了以下内容:

// 5.2)
val countPerPlug1h = countPreparedPerPlug1h.reduceByKeyAndWindow((a,b) => a+b, Seconds(windowSize1hSek), Seconds(slideDurationSek))

但在这里我收到以下错误:

<console>:69: error: missing parameter type
   val countPerPlug1h = countPreparedPerPlug1h.reduceByKeyAndWindow((a,b) => a+b, Seconds(windowSize1hSek), Seconds(slideDurationSek))
                                                                     ^

似乎我使用了reduceByKeyAndWindow转换错误,但错误在哪里?要总结的值的类型是Int,请参见上面步骤5.1中的countPreparedPerPlug1h。

2 个答案:

答案 0 :(得分:2)

在Scala中使用reduceByKeyAndWindow比在Java版本中更简单。您没有PairDStream,因为隐式确定了对,您可以直接调用pair方法。隐式解决方案转到PairDStreamFunctions

例如:

val myPairDStream: DStream[KeyType, ValueType] = ...
myPairDStream.reduceByKeyAndWindow(...)

幕后真的如下:

new PairDStreamFunctions(myPairDStream).reduceByKeyAndWindow(...)

这个PairDStreamFunctions的包装器被添加到由Tuple2

组成的任何DStream中

答案 1 :(得分:1)

我知道了,现在似乎可以使用以下代码:

val countPerPlug1h = countPreparedPerPlug1h.reduceByKeyAndWindow({(x, y) => x + y}, {(x, y) => x - y}, Seconds(windowSize1hSek), Seconds(slideDurationSek))

感谢你的线索,@ Justin Pihony