高效计算火花中的熵

时间:2014-07-19 07:34:32

标签: performance scala apache-spark entropy information-theory

给定RDD(数据),以及用于计算熵的索引字段列表。执行以下流程时,在2MB(16k行)源上计算单个熵值大约需要5秒。

def entropy(data: RDD[Array[String]], colIdx: Array[Int], count: Long): Double = { 
  println(data.toDebugString)
    data.map(r => colIdx.map(idx => r(idx)).mkString(",") -> 1)
        .reduceByKey(_ + _)
        .map(v => {
        val p = v._2.toDouble / count
        -p * scala.math.log(p) / scala.math.log(2)
      })
        .reduce((v1, v2) => v1 + v2)
}

debugString的输出如下:

(entropy,MappedRDD[93] at map at Q.scala:31 (8 partitions)
  UnionRDD[72] at $plus$plus at S.scala:136 (8 partitions)
    MappedRDD[60] at map at S.scala:151 (4 partitions)
      FilteredRDD[59] at filter at S.scala:150 (4 partitions)
        MappedRDD[40] at map at S.scala:124 (4 partitions)
          MapPartitionsRDD[39] at mapPartitionsWithIndex at L.scala:356 (4 partitions)
            FilteredRDD[27] at filter at S.scala:104 (4 partitions)
              MappedRDD[8] at map at X.scala:21 (4 partitions)
                MappedRDD[6] at map at R.scala:39 (4 partitions)
                  FlatMappedRDD[5] at objectFile at F.scala:51 (4 partitions)
                    HadoopRDD[4] at objectFile at F.scala:51 (4 partitions)
    MappedRDD[68] at map at S.scala:151 (4 partitions)
      FilteredRDD[67] at filter at S.scala:150 (4 partitions)
        MappedRDD[52] at map at S.scala:124 (4 partitions)
          MapPartitionsRDD[51] at mapPartitionsWithIndex at L.scala:356 (4 partitions)
            FilteredRDD[28] at filter at S.scala:105 (4 partitions)
              MappedRDD[8] at map at X.scala:21 (4 partitions)
                MappedRDD[6] at map at R.scala:39 (4 partitions)
                  FlatMappedRDD[5] at objectFile at F.scala:51 (4 partitions)
                    HadoopRDD[4] at objectFile at F.scala:51 (4 partitions),colIdex,13,count,3922)

如果我再次收集 RDD parallelize ,则计算需要大约150ms(对于简单的2MB文件来说仍然很高) - 显然在处理时会产生挑战具有多个GB数据。为了正确使用Spark和Scala,我错过了什么?

我最初的实施(表现更差):

data.map(r => colIdx
  .map(idx => r(idx)).mkString(","))
  .groupBy(r => r)
  .map(g => g._2.size)
  .map(v => v.toDouble / count)
  .map(v => -v * scala.math.log(v) / scala.math.log(2))
  .reduce((v1, v2) => v1 + v2)

1 个答案:

答案 0 :(得分:3)

首先,您的代码中似乎存在错误,您需要处理p 0,因此-p * math.log(p) / math.log(2)应为if (p == 0.0) 0.0 else -p * math.log(p) / math.log(2)

其次,你可以使用基数e,你真的不需要有2的基数。

无论如何,你的代码很慢的原因可能是因为分区很少。每个CPU应该至少有2-4个分区,实际上我经常使用更多分区。你有多少CPU?

现在可能花费最长时间的不是熵计算,因为它非常简单 - 但reduceByKey正在String键上完成。是否可以使用其他一些数据类型?什么是colIdx?究竟是什么?

最后一个观察结果是您使用此colIdx.map(r.apply)多次为每条记录建立索引...您知道如果r不是Array类型或IndexedSeq,这将会非常慢} ...如果它是List它将是O(索引),因为你必须遍历列表以获得你想要的索引。