Vector的foreachActive功能

时间:2015-01-25 17:20:28

标签: scala apache-spark

有人可以帮助我理解为矢量引入的“foreachActive”功能的用法。

我试图了解它在MultivariateOnlineSummarizer类中的用法,以便进行汇总统计。

sample.foreachActive { (index, value) =>
  if (value != 0.0) {
    if (currMax(index) < value) {
      currMax(index) = value
    }
    if (currMin(index) > value) {
      currMin(index) = value
    }

    val prevMean = currMean(index)
    val diff = value - prevMean
    currMean(index) = prevMean + diff / (nnz(index) + 1.0)
    currM2n(index) += (value - currMean(index)) * diff
    currM2(index) += value * value
    currL1(index) += math.abs(value)

    nnz(index) += 1.0
  }
}

1 个答案:

答案 0 :(得分:0)

火花DenseVector&amp;中有2种矢量。斯帕塞夫克托

对于DenseVector,所有元素都是活动的,因此foreachActive有效地变为foreach

  private[spark] override def foreachActive(f: (Int, Double) => Unit) = {
    var i = 0
    val localValuesSize = values.size
    val localValues = values

    while (i < localValuesSize) {
      f(i, localValues(i))
      i += 1
    }
  }

SparseVector可以有非活动元素,应该在foreach中手动跳过,或者使用foreachActive,它在引擎盖下执行

  private[spark] override def foreachActive(f: (Int, Double) => Unit) = {
    var i = 0
    val localValuesSize = values.size
    val localIndices = indices
    val localValues = values

    while (i < localValuesSize) {
      f(localIndices(i), localValues(i))
      i += 1
    }
  }

因此,这对于Vectors来说是有效的foreach函数,它只过滤掉活动元素,而不管Vector实现。