Apache spark分布式矩阵运算?

时间:2015-03-30 12:42:24

标签: scala apache-spark

课程:https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix

Matrix操作的分布方式与RDD相同吗?从阅读文档来看,情况似乎并非如此(因为没有提到)。

所以,如果我跑:

      package worksheets

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD

object matrix {
  println("Welcome to the Scala worksheet")       //> Welcome to the Scala worksheet

  val conf = new org.apache.spark.SparkConf()
    .setMaster("local")
    .setAppName("filter")
    .setSparkHome("C:\\spark-1.1.0-bin-hadoop2.4\\spark-1.1.0-bin-hadoop2.4")
    .set("spark.executor.memory", "512m");        //> conf  : org.apache.spark.SparkConf = org.apache.spark.SparkConf@1faf8f2

  val sc = new org.apache.spark.SparkContext(conf)//> Using Spark's default log4j profile: org/apache/spark/log4j-defaults.propert
                                                  //| ies
                                                  //| 15/03/30 13:33:33 INFO SecurityManager: Changing view acls to: user
                                                  //| 15/03/30 13:33:33 INFO SecurityManager: Changing modify acls to: user
                                                  //| 15/03/30 13:33:33 INFO SecurityManager: SecurityManager: authentication disa
                                                  //| bled; ui acls disabled; users with view permissions: Set(user); user
                                                  //| Output exceeds cutoff limit.

  // make an RDD from the resultant sequence of Vectors, and
  // make a RowMatrix from that.
  val dm: DenseMatrix = new DenseMatrix(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
                                                  //> dm  : org.apache.spark.mllib.linalg.DenseMatrix = 1.0  2.0  
                                                  //| 3.0  4.0  
                                                  //| 5.0  6.0  

  val md: DenseMatrix = new DenseMatrix(2, 3, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
                                                  //> md  : org.apache.spark.mllib.linalg.DenseMatrix = 1.0  5.0  4.0  
                                                  //| 3.0  2.0  6.0  

  dm.multiply(md)                                 //> 15/03/30 13:33:42 WARN BLAS: Failed to load implementation from: com.github
                                                  //| .fommil.netlib.NativeSystemBLAS
                                                  //| 15/03/30 13:33:42 WARN BLAS: Failed to load implementation from: com.github
                                                  //| .fommil.netlib.NativeRefBLAS
                                                  //| res0: org.apache.spark.mllib.linalg.DenseMatrix = 7.0   9.0   16.0  
                                                  //| 15.0  23.0  36.0  
                                                  //| 23.0  37.0  56.0  
}

Spark似乎没有分发此操作?

这个Jira似乎暗示将来可能会支持这个:https://issues.apache.org/jira/browse/SPARK-3434。如果要将矩阵存储为分布式块矩阵,那么将对它们的操作进行分配吗?

1 个答案:

答案 0 :(得分:1)

这值得一看:

http://apache-spark-developers-list.1001551.n3.nabble.com/Matrix-Multiplication-of-two-RDD-Array-Double-s-td6656.html

Hi Liquan, 

There is some working being done on implementing linear algebra algorithms 
on Spark for use in higher-level machine learning algorithms.  That work is 
happening in the MLlib project, which has a 
org.apache.spark.mllib.linalgpackage you may find useful. 

See 
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/linalg

From my quick look (never read this code before and not familiar with 
MLlib) both the IndexedRowMatrix and RowMatrix implement a multiply 
operation: 

aash@aash-mbp~/git/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg$ 
git grep 
'def multiply' 
distributed/IndexedRowMatrix.scala:  def multiply(B: Matrix): 
IndexedRowMatrix = { 
distributed/RowMatrix.scala:  def multiply(B: Matrix): RowMatrix = { 
aash@aash-mbp~/git/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg$ 

Can you look into using that code and let us know if it meets your needs? 

Thanks! 
Andrew 


On Sat, May 17, 2014 at 10:28 PM, Liquan Pei <[hidden email]> wrote: 

> Hi 
> 
> I am currently implementing an algorithm involving matrix multiplication. 
> Basically, I have matrices represented as RDD[Array[Double]]. For example, 
> If I have A:RDD[Array[Double]] and B:RDD[Array[Double]] and what would be 
> the most efficient way to get C = A * B 
> 
> Both A and B are large, so it would not be possible to save either of them 
> in memory. 
> 
> Thanks a lot for your help! 
> 
> Liquan 
> 
... [show rest of quote]