如何将两个分布式mllib矩阵相乘,并将结果返回到Scala中的独立火花簇中,其中包含9个工作机和1个驱动程序机器?有27名工人,即每个工人机器3名工人,每个工人有两个核心。乘法是用相应的分区完成的,即矩阵A的第一个分区和矩阵B的第一个分区,依此类推。我正在计划27个分区。
矩阵乘积的结果应该是分区接收的。而且,如何在每个分区中保持相同数量的记录?矩阵A较小但矩阵B较大,不能适合单个机器的内存。目标是实现对Mat A和Mat B的分区产品的进一步转换。
让我用以下代码清除它。以下代码创建了两个块矩阵。
//creation of blocks as local matrices which are components of first block matrix
val eye1 = Matrices.dense(3, 2, Array(1, 2, 3, 4, 5, 6))
val eye2 = Matrices.dense(3, 2, Array(4, 5, 6, 7, 8, 9))
val eye3 = Matrices.dense(3, 2, Array(7, 8, 9, 1, 2, 3))
val eye4 = Matrices.dense(3, 2, Array(4, 5, 6, 1, 2, 3))
val blocks = sc.parallelize(Seq(
((0, 0), eye1), ((1, 1), eye2), ((2, 2), eye3), ((3, 3), eye4)),4)
//block matrix created with 3 rows per block and 2 columns per block
val blockMatrix = new BlockMatrix(blocks, 3, 2)
//creation of blocks as local matrices which are components of second block matrix
val eye5 = Matrices.dense(2, 4, Array(1, 2, 3, 4, 5, 6, 7, 8))
val eye6 = Matrices.dense(2, 4, Array(2, 4, 6, 8, 10, 12, 14, 16))
val eye7 = Matrices.dense(2, 4, Array(3, 6, 9, 12, 15, 18, 21, 24))
val eye8 = Matrices.dense(2, 4, Array(4, 8, 12, 16, 20, 24, 28, 32))
val blocks1 = sc.parallelize(Seq(
((0, 0), eye5), ((1, 1), eye6), ((2, 2), eye7), ((3,3), eye8)),4)
//block matrix created with 2 rows per block and 4 columns per block
val blockMatrix1 = new BlockMatrix(blocks1, 2, 4)
//The following line multiplies the block matrices
val blockProduct = blockMatrix.multiply(blockMatrix1)
//the indices of block matrix are converted to RDD
var blockMatrixIndex = blockProduct.blocks.map{
case((a,b),m) => (a,b)}
var (blockRowIndexMaxValue, blockColIndexMaxValue) = blockMatrixIndex.max()
//the data of block of blockmatrix is converted to RDD
var blockMatrixRDD = blockProduct.blocks.map{
case((a,b),m) => m}
//elements of block matrix are doubled
var blockMatrixRDDElementDoubled = blockMatrixRDD.map(x => x.toArray.map(y => 2*y))
//code for finding number of rows of individual block in the block matrix
var blockMatRowCount = blockMatrixRDD.map(x => x.numRows).first
//code for finding number of columns of individual block in the block matrix
var blockMatColCount = blockMatrixRDD.map(x => x.numCols).first
//data block of block matrix is recreated
var blockMatrixBlockRecreated = blockMatrixRDDElementDoubled.map(x => Matrices.dense(blockMatRowCount, blockMatColCount, x))
//code for generating index sequence for blocks of blockmatrix
val indexRange = List.range(0, blockRowIndexMaxValue + 1)
var indexSeq = indexRange zip indexRange
//partitioning index sequence into 4 partitions
var indexSeqRDD = sc.parallelize(indexSeq, blockRowIndexMaxValue + 1)
//code for regenerating block matrix in RDD form
var completeBlockMatrixRecreated = indexSeqRDD.zip(blockMatrixBlockRecreated)
completeBlockMatrixRecreated的类型为org.apache.spark.rdd.RDD [((Int,Int),org.apache.spark.mllib.linalg.Matrix)]。所以它应该包含4个块。
如果我正在尝试执行
completeBlockMatrixRecreated.take(2)
显示错误“org.apache.spark.SparkException:只能压缩每个分区中具有相同元素数量的RDD”