Spark:摘要统计

时间:2015-01-23 14:49:06

标签: scala apache-spark

我正在尝试使用Spark摘要统计信息,如:https://spark.apache.org/docs/1.1.0/mllib-statistics.html

所述

根据Spark文档:

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.DenseVector

val observations: RDD[Vector] = ... // an RDD of Vectors

// Compute column summary statistics.
val summary: MultivariateStatisticalSummary =     Statistics.colStats(observations)

构建observations:RDD[Vector]对象时遇到问题。我试试:

scala> val data:Array[Double] = Array(1, 2, 3, 4, 5)
data: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)

scala> val v = new DenseVector(data)
v: org.apache.spark.mllib.linalg.DenseVector = [1.0,2.0,3.0,4.0,5.0]

scala> val observations = sc.parallelize(Array(v))
observations:   org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] =   ParallelCollectionRDD[3] at parallelize at <console>:19

scala> val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
<console>:21: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector]
 required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.DenseVector <: org.apache.spark.mllib.linalg.Vector, but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
val summary: MultivariateStatisticalSummary =  Statistics.colStats(observations)

问题:

1)我应该如何将DenseVector转换为Vector?

2)在真实的程序而不是双打数组中,我得到了一个我从RDD获得的集合的统计数据:

def countByKey(): Map[K, Long]
//Count the number of elements for each key, and return the result to the master as a Map.

所以我必须这样做:

 myRdd.countByKey().values.map(_.toDouble)

这没有什么意义,因为我现在不得不使用常规的Scala集合,而不是使用RDD,而这些集合在某些时候会停止适应内存。 Spark分布式计算的所有优点都将丢失。

如何以可扩展的方式解决这个问题?

更新

就我而言,我有:

val cnts: org.apache.spark.rdd.RDD[Int] = prodCntByCity.map(_._2) // get product counts only 
val doubleCnts: org.apache.spark.rdd.RDD[Double] = cnts.map(_.toDouble)

如何将doubleCnts转换为observations: RDD[Vector]

2 个答案:

答案 0 :(得分:1)

1)你不需要施放,你只需输入:

val observations = sc.parallelize(Array(v: Vector))

2)使用aggregateByKey(将所有键映射到1,并通过求和减少)而不是countByKey

答案 1 :(得分:0)

DenseVector具有压缩功能。因此您可以将RDD [DenseVector]更改为RDD [Vector],如下所示:

pickImage()