我正在尝试使用IndexedRowMatrix的computeSVD方法填充矩阵来检查spark中的一些工作。这是我正在使用的代码:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
import org.apache.spark.rdd.RDD
import scala.util.Random
val nCol = 2000
val nRow = 10000
val numbers: Seq[Int] = (0 until nRow).toSeq
val numbersRDD: RDD[Int] = sc.parallelize(numbers)
val indexedRowRDD = numbersRDD.
map(number => new Random(number)).
map(random => Array.fill(nCol){random.nextDouble()}).
map(values => new IndexedRow(1, Vectors.dense(values))).
cache
当我在spark上执行它时,它在地图中产生以下异常,用随机双精度填充数组:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
... 52 elided
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@62264d4f)
- field (class: $iw, name: sparkContext, type: class org.apache.spark.SparkContext)
- object (class $iw, $iw@ab5d5c8)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@99ace98)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@1a99d328)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@140e003e)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@43f4621b)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@1ac9c3cc)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@5078e308)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@45382749)
- field (class: $line19.$read, name: $iw, type: class $iw)
- object (class $line19.$read, $line19.$read@3695e4f2)
- field (class: $iw, name: $line19$read, type: class $line19.$read)
- object (class $iw, $iw@12418d3f)
- field (class: $iw, name: $outer, type: class $iw)
- object (class $iw, $iw@71077d1)
- field (class: $anonfun$2, name: $outer, type: class $iw)
- object (class $anonfun$2, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 61 more
我不明白为什么SparkContext被序列化,因此,Spark抱怨SparkContext不可序列化。