我正在尝试减少时间序列数据,以便在1小时内收集结果(用于检测最大值,最小值,平均值)。
看起来我无法在reduce块中提供确定是否应该进行减少(添加到数组中的值)或减少跳过的条件。
//data
//ID, VAL, DATETIME
tvFile.map((x) =>
(x.split(',')(0), (Array(x.split(',')(1)), Array(x.split(',')(2))))) //(ID, ([VAL], [DATETIME])
.reduceByKey((a,b) => {
val dt1 = DateTime.parse(a._2(0))
val dt2 = DateTime.parse(b._2(0))
if ((dt1.getDayOfYear == dt2.getDayOfYear) && (dt1.getHourOfDay == dt2.getHourOfDay))
(a._1 ++ b._1, a._2 ++ b._2)
else
// NOT SURE WHAT TO DO HERE
}).collect
以上不是最有效/正确/我从Spark / Scala开始。
答案 0 :(得分:2)
方法应该是准备数据,以便拥有一个用于对聚合数据进行分区的密钥。按照问题中的代码,在这种情况下,密钥应为(id, day-of-year, hr-of-day)
正确准备数据后,汇总很简单。
示例:
val sampleData = Seq("p1,38.1,2016-11-26T11:15:10",
"p1,39.1,2016-11-26T11:16:10",
"p1,35.8,2016-11-26T11:17:10",
"p1,34.1,2016-11-26T11:18:10",
"p2,37.2,2016-11-26T11:16:00",
"p2,31.2,2016-11-27T11:17:00",
"p2,31.6,2016-11-27T11:17:00",
"p1,39.4,2016-11-26T12:15:10",
"p2,36.3,2016-11-27T10:10:10",
"p1,39.5,2016-11-27T12:15:00",
"p3,36.1,2016-11-26T11:15:10")
val sampleDataRdd = sparkContext.parallelize(sampleData)
val records = sampleDataRdd.map{line =>
val parts = line.split(",")
val id = parts(0)
val value = parts(1).toDouble
val dateTime = DateTime.parse(parts(2))
val doy = dateTime.getDayOfYear
val hod = dateTime.getHourOfDay
((id, doy, hod), value)
}
val aggregatedRecords = records.reduceByKey(_ + _)
aggregatedRecords.collect
// Array[((String, Int, Int), Double)] = Array(((p1,331,11),147.10000000000002), ((p2,332,11),62.8), ((p2,331,11),37.2), ((p1,332,12),39.5), ((p2,332,10),36.3), ((p1,331,12),39.4), ((p3,331,11),36.1))
使用Spark DataFrames
这也容易得多。使用RDD API回答了问题中的问题。