课程:https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix
Matrix操作的分布方式与RDD相同吗?从阅读文档来看,情况似乎并非如此(因为没有提到)。
所以,如果我跑:
package worksheets
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
object matrix {
println("Welcome to the Scala worksheet") //> Welcome to the Scala worksheet
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
.setAppName("filter")
.setSparkHome("C:\\spark-1.1.0-bin-hadoop2.4\\spark-1.1.0-bin-hadoop2.4")
.set("spark.executor.memory", "512m"); //> conf : org.apache.spark.SparkConf = org.apache.spark.SparkConf@1faf8f2
val sc = new org.apache.spark.SparkContext(conf)//> Using Spark's default log4j profile: org/apache/spark/log4j-defaults.propert
//| ies
//| 15/03/30 13:33:33 INFO SecurityManager: Changing view acls to: user
//| 15/03/30 13:33:33 INFO SecurityManager: Changing modify acls to: user
//| 15/03/30 13:33:33 INFO SecurityManager: SecurityManager: authentication disa
//| bled; ui acls disabled; users with view permissions: Set(user); user
//| Output exceeds cutoff limit.
// make an RDD from the resultant sequence of Vectors, and
// make a RowMatrix from that.
val dm: DenseMatrix = new DenseMatrix(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
//> dm : org.apache.spark.mllib.linalg.DenseMatrix = 1.0 2.0
//| 3.0 4.0
//| 5.0 6.0
val md: DenseMatrix = new DenseMatrix(2, 3, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
//> md : org.apache.spark.mllib.linalg.DenseMatrix = 1.0 5.0 4.0
//| 3.0 2.0 6.0
dm.multiply(md) //> 15/03/30 13:33:42 WARN BLAS: Failed to load implementation from: com.github
//| .fommil.netlib.NativeSystemBLAS
//| 15/03/30 13:33:42 WARN BLAS: Failed to load implementation from: com.github
//| .fommil.netlib.NativeRefBLAS
//| res0: org.apache.spark.mllib.linalg.DenseMatrix = 7.0 9.0 16.0
//| 15.0 23.0 36.0
//| 23.0 37.0 56.0
}
Spark似乎没有分发此操作?
这个Jira似乎暗示将来可能会支持这个:https://issues.apache.org/jira/browse/SPARK-3434。如果要将矩阵存储为分布式块矩阵,那么将对它们的操作进行分配吗?
答案 0 :(得分:1)
这值得一看:
Hi Liquan,
There is some working being done on implementing linear algebra algorithms
on Spark for use in higher-level machine learning algorithms. That work is
happening in the MLlib project, which has a
org.apache.spark.mllib.linalgpackage you may find useful.
See
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/linalg
From my quick look (never read this code before and not familiar with
MLlib) both the IndexedRowMatrix and RowMatrix implement a multiply
operation:
aash@aash-mbp~/git/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg$
git grep
'def multiply'
distributed/IndexedRowMatrix.scala: def multiply(B: Matrix):
IndexedRowMatrix = {
distributed/RowMatrix.scala: def multiply(B: Matrix): RowMatrix = {
aash@aash-mbp~/git/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg$
Can you look into using that code and let us know if it meets your needs?
Thanks!
Andrew
On Sat, May 17, 2014 at 10:28 PM, Liquan Pei <[hidden email]> wrote:
> Hi
>
> I am currently implementing an algorithm involving matrix multiplication.
> Basically, I have matrices represented as RDD[Array[Double]]. For example,
> If I have A:RDD[Array[Double]] and B:RDD[Array[Double]] and what would be
> the most efficient way to get C = A * B
>
> Both A and B are large, so it would not be possible to save either of them
> in memory.
>
> Thanks a lot for your help!
>
> Liquan
>
... [show rest of quote]