假设我有一个RowMatrix。
我已将RowMatrix转换为DenseMatrix,如下所示
DenseMatrix Mat = new DenseMatrix(m,n,MatArr);
需要将RowMatrix转换为JavaRDD并将JavaRDD转换为数组。
有没有其他方便的方法进行转换?
提前致谢
答案 0 :(得分:17)
如果有人感兴趣,我已经实现了@javadba提出的分发版本。
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val transposedRowsRDD = m.rows.zipWithIndex.map{case (row, rowIndex) => rowToTransposedTriplet(row, rowIndex)}
.flatMap(x => x) // now we have triplets (newRowIndex, (newColIndex, value))
.groupByKey
.sortByKey().map(_._2) // sort rows and remove row indexes
.map(buildRow) // restore order of elements in each row and remove column indexes
new RowMatrix(transposedRowsRDD)
}
def rowToTransposedTriplet(row: Vector, rowIndex: Long): Array[(Long, (Long, Double))] = {
val indexedRow = row.toArray.zipWithIndex
indexedRow.map{case (value, colIndex) => (colIndex.toLong, (rowIndex, value))}
}
def buildRow(rowWithIndexes: Iterable[(Long, Double)]): Vector = {
val resArr = new Array[Double](rowWithIndexes.size)
rowWithIndexes.foreach{case (index, value) =>
resArr(index.toInt) = value
}
Vectors.dense(resArr)
}
答案 1 :(得分:6)
您可以使用BlockMatrix,它可以从IndexedRowMatrix创建:
BlockMatrix matA = (new IndexedRowMatrix(...).toBlockMatrix().cache();
matA.validate();
BlockMatrix matB = matA.transpose();
然后,可以轻松地将其作为IndexedRowMatrix。这在spark documentation。
中有所描述答案 2 :(得分:4)
你是对的:没有
RowMatrix.transpose()
方法。您需要手动执行此操作。
以下是 非分布式/本地 矩阵版本:
def transpose(m: Array[Array[Double]]): Array[Array[Double]] = {
(for {
c <- m(0).indices
} yield m.map(_(c)) ).toArray
}
分发版 将沿着以下几行:
origMatRdd.rows.zipWithIndex.map{ case (rvect, i) =>
rvect.zipWithIndex.map{ case (ax, j) => ((j,(i,ax))
}.groupByKey
.sortBy{ case (i, ax) => i }
.foldByKey(new DenseVector(origMatRdd.numRows())) { case (dv, (ix,ax)) =>
dv(ix) = ax
}
警告:我没有测试过上述内容: 会有错误。但基本的方法是有效的 - 类似于我过去为一个小的LinAlg库做火花的工作。
答案 3 :(得分:2)
对于非常大且稀疏的矩阵(就像从文本特征提取得到的那样),最好和最简单的方法是:
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val indexedRM = new IndexedRowMatrix(m.rows.zipWithIndex.map({
case (row, idx) => new IndexedRow(idx, row)}))
val transposed = indexedRM.toCoordinateMatrix().transpose.toIndexedRowMatrix()
new RowMatrix(transposed.rows
.map(idxRow => (idxRow.index, idxRow.vector))
.sortByKey().map(_._2))
}
对于不那么稀疏的矩阵,你可以使用BlockMatrix作为上面aletapool的回答中提到的桥梁。
然而,aletapool的回答错过了一个非常重要的观点:当你从RowMaxtrix开始时 - &gt; IndexedRowMatrix - &gt; BlockMatrix - &gt;转置 - &gt; BlockMatrix - &gt; IndexedRowMatrix - &gt; RowMatrix,在最后一步(IndexedRowMatrix - &gt; RowMatrix),你必须做一个排序。因为默认情况下,从IndexedRowMatrix转换为RowMatrix,索引将被简单地删除,订单将被搞砸。
val data = Array(
MllibVectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
MllibVectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
MllibVectors.dense(4.0, 0.0, 0.0, 6.0, 7.0),
MllibVectors.sparse(5, Seq((2, 2.0), (3, 7.0))))
val dataRDD = sc.parallelize(data, 4)
val testMat: RowMatrix = new RowMatrix(dataRDD)
testMat.rows.collect().map(_.toDense).foreach(println)
[0.0,1.0,0.0,7.0,0.0]
[2.0,0.0,3.0,4.0,5.0]
[4.0,0.0,0.0,6.0,7.0]
[0.0,0.0,2.0,7.0,0.0]
transposeRowMatrix(testMat).
rows.collect().map(_.toDense).foreach(println)
[0.0,2.0,4.0,0.0]
[1.0,0.0,0.0,0.0]
[0.0,3.0,0.0,2.0]
[7.0,4.0,6.0,7.0]
[0.0,5.0,7.0,0.0]
答案 4 :(得分:0)
这是前一个解决方案的变体,但适用于稀疏行矩阵并在需要时保持转置稀疏:
def transpose(X: RowMatrix): RowMatrix = {
val m = X.numRows ().toInt
val n = X.numCols ().toInt
val transposed = X.rows.zipWithIndex.flatMap {
case (sp: SparseVector, i: Long) => sp.indices.zip (sp.values).map {case (j, value) => (i, j, value)}
case (dp: DenseVector, i: Long) => Range (0, n).toArray.zip (dp.values).map {case (j, value) => (i, j, value)}
}.sortBy (t => t._1).groupBy (t => t._2).map {case (i, g) =>
val (indices, values) = g.map {case (i, j, value) => (i.toInt, value)}.unzip
if (indices.size == m) {
(i, Vectors.dense (values.toArray) )
} else {
(i, Vectors.sparse (m, indices.toArray, values.toArray))
}
}.sortBy(t => t._1).map (t => t._2)
new RowMatrix (transposed)
}
希望这有帮助!