了解Spark相关算法

时间:2017-07-03 15:57:42

标签: algorithm scala apache-spark cross-correlation pearson-correlation

我正在阅读Spark相关算法源代码,在浏览代码时,我无法理解这种特殊的代码安静。

这是来自文件:org / apache / spark / mllib / linalg / BLAS.scala

 def spr(alpha: Double, v: Vector, U: Array[Double]): Unit = {
    val n = v.size
    v match {
      case DenseVector(values) =>
        NativeBLAS.dspr("U", n, alpha, values, 1, U)
      case SparseVector(size, indices, values) =>
        val nnz = indices.length
        var colStartIdx = 0
        var prevCol = 0
        var col = 0
        var j = 0
        var i = 0
        var av = 0.0
        while (j < nnz) {
          col = indices(j)
          // Skip empty columns.
          colStartIdx += (col - prevCol) * (col + prevCol + 1) / 2
          av = alpha * values(j)
          i = 0
          while (i <= j) {
            U(colStartIdx + indices(i)) += av * values(i)
            i += 1
          }
          j += 1
          prevCol = col
        }
    }
  }

我不知道Scala,这可能是我无法理解的原因。有人可以解释这里发生的事情。

从Rowmatrix.scala调用它

  def computeGramianMatrix(): Matrix = {
    val n = numCols().toInt
    checkNumColumns(n)
    // Computes n*(n+1)/2, avoiding overflow in the multiplication.
    // This succeeds when n <= 65535, which is checked above
    val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))

    // Compute the upper triangular part of the gram matrix.
    val GU = rows.treeAggregate(new BDV[Double](nt))(
      seqOp = (U, v) => {
        BLAS.spr(1.0, v, U.data)
        U
      }, combOp = (U1, U2) => U1 += U2)

    RowMatrix.triuToFull(n, GU.data)
  }

此处定义了相关性: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

最终目标是了解Spark相关算法。

更新1 :相关文件https://stanford.edu/~rezab/papers/linalg.pdf

0 个答案:

没有答案