行的RDD元素之间的平均值

时间:2016-11-13 15:53:28

标签: scala apache-spark rdd

我有一个遵循这种结构的许多行的RDD(即RDDmacReturns):

case class macReturns (macAddress: String, 
                       hourReturns: Long, 
                       threeHoursReturns: Long,
                       sixHoursReturns: Long, 
                       halfDailyReturns: Long, 
                       dailyReturns: Long,
                       threeDailyReturns: Long, 
                       weeklyReturns: Long, 
                       biWeeklyReturns: Long, 
                       threeWeeklyReturns: Long, 
                       monthlyReturns: Long)

所以,例如,RDD的一行就像:

macReturns(a2:b2:c3:d3:f4:c5,3,4,1,0,3,4,3,5,1,7)

macAddresses已经被分组,因此它们都是截然不同的。

现在,我必须创建一个带有单行的新RDD,在RDDmacReturns上执行转换/操作,它遵循相同的上述结构(案例类MacReturns)并包含一个固定的选择(伪)macAddress和在RDDmacReturns的元素之间计算的每个字段的平均值,如下所示:

macReturns(00:00:00:00:00:00,
           averageHourReturns,
           averageThreeHoursReturns,
           averageSixHoursReturns,
           averageHalfDailyReturns,
           averageDailyReturns,
           averageThreeDailyReturns,
           averageWeeklyReturns,
           averageBiWeeklyReturns,
           averageThreeWeeklyReturns,
           averageMonthlyReturns)

总而言之,我需要一个应用于RDDmacReturns的函数,返回包含单行的RDDaverageReturns(如上所述)

感谢您的帮助

1 个答案:

答案 0 :(得分:1)

您可以使用colStats()返回MultivariateStatisticalSummary的实例,其中包含列式mean。这是一个类似于您的问题的可重现示例:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

val rdd = sc.parallelize(Seq(
  ("id1",1,2,3,4),
  ("id2",3,5,1,5),
  ("id3",3,0,9,8),
  ("id4",4,4,1,2)))
// First we convert to RDD of dense vectors 
val rdd_dense = rdd.map(x => Vectors.dense(x._2, x._3, x._4, x._5))
// Attain colStats and grab the mean
val summary: MultivariateStatisticalSummary = Statistics.colStats(rdd_dense)
println(summary.mean) 
[2.75,2.75,3.5000000000000004,4.75]