仅对RDD的子集执行操作

时间:2014-05-11 15:48:19

标签: apache-spark

我想仅对RDD的一个子集执行一些转换(以便更快地在REPL中进行实验)。

有可能吗?

RDD有take(num: Int): Array[T]方法,我想我需要类似的东西,但返回RDD [T]

4 个答案:

答案 0 :(得分:19)

您可以使用RDD.sample来获取RDD,而不是Array。例如,在没有替换的情况下采样~1%:

val data = ...
data.count
...
res1: Long = 18066983

val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt)
sample.count
...
res3: Long = 180190

第三个参数是种子,幸运的是在下一个Spark版本中是可选的。

答案 1 :(得分:2)

RDDs are distributed collections which are materialized on actions only. It is not possible to truncate your RDD to a fixed size, and still get an RDD back (hence RDD.take(n) returns an Array[T], just like collect)

I you want to get similarly sized RDDs regardless of the input size, you can truncate items in each of your partitions - this way you can better control the absolute number of items in resulting RDD. Size of the resulting RDD will depend on spark parallelism.

An example from spark-shell:

import org.apache.spark.rdd.RDD
val numberOfPartitions = 1000

val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions)

val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10))

val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10))

millionRdd.count          // 1000000
millionRddTruncated.count // 10000 = 10 item * 1000 partitions
billionRddTruncated.count // 10000 = 10 item * 1000 partitions

答案 2 :(得分:0)

显然可以首先使用其take方法创建RDD子集,然后将返回的数组传递给SparkContext的makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism),返回新的RDD。

这种方法对我来说似乎很狡猾。有更好的方式吗?

答案 3 :(得分:0)

我总是使用SparkContext的parallelize函数从Array [T]中分发,但似乎makeRDD也是如此。这两种方式都是正确的。