根据Spark源代码评论。
SparkContext.scala有
/** Distribute a local Scala collection to form an RDD.
*
* @note Parallelize acts lazily. If `seq` is a mutable collection and is altered after the call
* to parallelize and before the first action on the RDD, the resultant RDD will reflect the
* modified collection. Pass a copy of the argument to avoid this.
* @note avoid using `parallelize(Seq())` to create an empty `RDD`. Consider `emptyRDD` for an
* RDD with no partitions, or `parallelize(Seq[T]())` for an RDD of `T` with empty partitions.
*/
所以,我以为我会做一个简单的测试。
scala> var c = List("a0", "b0", "c0", "d0", "e0", "f0", "g0")
c: List[String] = List(a0, b0, c0, d0, e0, f0, g0)
scala> var crdd = sc.parallelize(c)
crdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:26
scala> c = List("x1", "y1")
c: List[String] = List(x1, y1)
scala> crdd.foreach(println)
[Stage 0:> (0 + 0) / 8]d0
a0
b0
e0
f0
g0
c0
scala>
我希望crdd.foreach(println)
根据x1
的懒惰行为输出“y1
”和“parallelize
”。
我做错了什么?
答案 0 :(得分:2)
您根本没有修改c
。您将其重新分配给新列表。
除此之外,
如果
seq
是一个可变集合
Scala的List
不是一个可变的集合
并在调用并行化之后和RDD上的第一个操作之前进行了更改
嗯,看,你没有真正改变名单。
以下是记录行为的正确示例。
scala> val c = scala.collection.mutable.ListBuffer(1, 2, 3)
c: scala.collection.mutable.ListBuffer[Int] = ListBuffer(1, 2, 3)
scala> val cRDD = sc.parallelize(c)
cRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:29
scala> c.append(4)
scala> c
res7: scala.collection.mutable.ListBuffer[Int] = ListBuffer(1, 2, 3, 4)
scala> cRDD.collect()
res8: Array[Int] = Array(1, 2, 3, 4)