Question

我正在学习如何使用spark mllib来计算两个matrics的产品。现在我的代码是这样的：

val rdd1=sc.textFile("rdd1").map(line=>line.split("\t").map(_.toDouble)).zipWithIndex().map{case(v,i)=>(i,v)}.map(x=>IndexedRow(x._1,Vectors.dense(x._2)))
val rdd2=sc.textFile("rdd2").map(line=>line.split("\t").map(_.toDouble)).zipWithIndex().map{case(v,i)=>(i,v)}.map(x=>IndexedRow(x._1,Vectors.dense(x._2)))
val matrix1=new IndexedRowMatrix(test1)
val matrix2=new IndexedRowMatrix(test2)

我想要matrix1乘以矩阵2，我尝试了这个：

matrix1.multiply(matrix2)

但是matrix2必须是一个局部矩阵，不能是IndexedRowMatrix（在API文档中说）

def multiply(B: Matrix): IndexedRowMatrix
Multiply this matrix by a local matrix on the right.
B:a local matrix whose number of rows must match the number of columns of this matrix
returns:an IndexedRowMatrix representing the product, which preserves partitioning

还有其他办法吗？

Answer 1

有一种方法可以使用RDD将2个IndexedRowMatrix相乘，但您需要自己编写。请注意，在我实现的实现中，您将获得DenseMatrix作为结果。

背景

假设您有2个矩阵 Amxn 和 Bnxp ，并且您想要计算 Amxn * Bnxp = Cmxp （通常n＆gt;＆gt; m和n＆gt;＆gt; p，否则您将无法使用IndexRowMatrices）

A（i）mx1 是 Amxn 的 i 列向量，它是＆＃39>存储在一行IndexedRowMatrix中。同样， B（i）1xp 是存储在对应的IndexedRowMatrix行中的 i 行向量。

同样不难证明 $C = \sum C_i$ 这样 $C_{[i]mxp} = A^{T}_{[i]mx1}B_{[i]1xp}$

当 nxp 很大时，上述两个操作可以在map + reduce操作中轻松实现，或者在 .treeAggregate 中更有效。

版本1：使用Breeze

使用Breeze Matrices进行乘法的简单实现，假设您的矩阵是密集的（如果不是，您可以进行一些进一步的优化）。

import breeze.linalg.{DenseMatrix => BDM}

def distributedMul(a: IndexedRowMatrix, b: IndexedRowMatrix, m: Int, p: Int): Matrix = {
  val aRows = a.rows.map((iV) => (iV.index, iV.vector))
  val bRows = b.rows.map((iV) => (iV.index, iV.vector))
  val joint = aRows.join(bRows)
  def vectorMul(e: (Long, (Vector, Vector))): BDM[Double] = {
    val v1 = BDM.create(rows, 1, e._2._1.toArray)
    val v2 = BDM.create[Double](1, cols, e._2._2.toArray)
    v1 * v2  // This is C(i) 
  }
  Matrices.dense(m, p, joint.map(vectorMul).reduce(_ + _).toArray)
}

备注

在 IndexedRowMatrix 上
numRows（）， numCols（）的成本可能很高。如果你知道尺寸，你可以立即提供它们作为参数

您可以使用 cartesian 而不是 join ，但是当索引不同时，您需要添加 if 并返回零矩阵< / LI>

版本2：使用BLAS

这个版本比另一个版本更有效（还有另一个版本只使用Scala数组，但效率极低）。您需要将它放在一个对象中，因为BLAS不可序列化。

import com.github.fommil.netlib.BLAS object SuperMul extends Serializable{ val blas = BLAS.getInstance() def distributedMul(a: IndexedRowMatrix, b: IndexedRowMatrix, m: Int, p: Int): Matrix = { val aRows = a.rows.map((iV) => (iV.index, iV.vector)) val bRows = b.rows.map((iV) => (iV.index, iV.vector)) val joint = aRows.join(bRows) val dim = m * p def summul(u: Array[Double], e: (Long, (Vector, Vector))): Array[Double] = { // u = a'(i)*b(i) + u blas.dgemm("N", "T", m, p, 1, 1.0, e._2._1.toArray, m, e._2._2.toArray, p, 1.0, u, m) u } def sum(u: Array[Double], v: Array[Double]): Array[Double] = { blas.daxpy(dim, 1.0, u, 1, v, 1) v } Matrices.dense(m, p, joint.treeAggregate(Array.fill[Double](dim)(0))(summul, sum)) } }

Answer 2

您可以在创建第二个IndexedRowMatrix之前计算局部矩阵并相乘。

val dArray = sc.textFile("rdd2").map(line=>line.split("\t").map(_.toDouble))为您提供所需的Double数组。

您可以使用Matrices.dense(rows, columns, dArray)并与第一个矩阵相乘。

然后，您可以继续为第二个矩阵创建IndexedRowMatrix。

如何将一个IndexedRowMatrix乘以spark mllib中的另一个IndexedRowMatrix

2 个答案:

背景

版本1：使用Breeze

版本2：使用BLAS