少量列和行上的Spark PCA OutOfMemory错误

时间:2015-04-16 19:36:47

标签: scala apache-spark out-of-memory pca apache-spark-mllib

我试图在具有2168列和大量行的RowMatrix上执行Spark MLLib PCA(使用Scala)。但是,我观察到即使矩阵中只有2行(一个112KB的文本文件),也会在同一工作步骤中产生以下错误:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space 
        at breeze.linalg.svd$.breeze$linalg$svd$$doSVD_Double(svd.scala:92) 
        at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:39) 
        at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:38) 
        at breeze.generic.UFunc$class.apply(UFunc.scala:48) 
        at breeze.linalg.svd$.apply(svd.scala:22) 
        at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:380) 
        at SimpleApp$.main(scala-pca.scala:17) 
        at SimpleApp.main(scala-pca.scala) 
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
        at java.lang.reflect.Method.invoke(Method.java:601) 
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) 
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) 
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) 
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) 
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我还观察到,无论RowMatrix中的行数是多少,都可以使用1100列或更少的列来纠正此错误。

我在21个节点上独立运行Spark 1.3.0,每个节点有12个工作站和20G内存。我通过spark-submit--driver-memory 6g通过--conf spark.executor.memory=1700m提交作业。在spark-env.sh中设置了以下选项:

SPARK_WORKER_MEMORY=1700M
SPARK_WORKER_CORES=1
SPARK_WORKER_INSTANCES=12

以下是我提交的代码:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.{Vector, Vectors}

object SimpleApp {
  def main(args: Array[String]) {
    val datafilePattern = "/path/to/data/files*.txt"
    val conf = new SparkConf().setAppName("pca_analysis").setMaster("master-host")
    val sc = new SparkContext(conf)
    val lData = sc.textFile(datafilePattern).cache()

    val vecData = lData.map(line => line.split(" ").map(v => v.toDouble)).map(arr => Vectors.dense(arr))
    val rmat: RowMatrix = new RowMatrix(vecData)
    val pc: Matrix = rmat.computePrincipalComponents(15)
    val projected: RowMatrix = rmat.multiply(pc)

    println("Finished projecting rows.")
  }
}

有没有其他人使用computePrincipalComponents()方法遇到此问题?非常感谢任何帮助。

1 个答案:

答案 0 :(得分:0)

我刚刚遇到了这个问题,解决此问题的方法是在需要时将--driver-memory增加到2G或更多。