我有包含元组的rdd,收集它们会给我一个列表。
[x1, x2, x3, x4, x5]
但是我想要那个列表的多个块
像[ [x1,x2,x3], [x4,x5] ]
为此,我可以先在rdd上执行一次collect,然后将结果列表分成大块。
但是我不执行收集就想要这样做,因为收集可能会增加堆空间错误,并将所有数据带给效率低下的驱动程序。
答案 0 :(得分:0)
问题:是否有任何有效的方法来分块具有 一个大列表分成几个列表而不执行收集
您可以将大rdd放入多个小RDD中,而不是将大列表收集和修改成多个列表,以进行进一步处理...
收集大RDD不是一个好主意。但是,如果您想将大rdd分成小部分,例如Array [RDD],可以使用以下方法在scala中编写,您可以通过查看示例here将其转换为python。
python文档here
您可以进行随机拆分,请参阅文档here
您可以从git中提供的代码中看到它是如何实现的:
/**
* Randomly splits this RDD with the provided weights.
*
* @param weights weights for splits, will be normalized if they don't sum to 1
* @param seed random seed
*
* @return split RDDs in an array
*/
def randomSplit(
weights: Array[Double],
seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
require(weights.forall(_ >= 0),
s"Weights must be nonnegative, but got ${weights.mkString("[", ",", "]")}")
require(weights.sum > 0,
s"Sum of weights must be positive, but got ${weights.mkString("[", ",", "]")}")
withScope {
val sum = weights.sum
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
randomSampleWithRange(x(0), x(1), seed)
}.toArray
}
}
Scala示例(不熟悉python :-)):有关python的信息,请参见文档here
import org.apache.log4j.Level import org.apache.spark.rdd.RDD import org.apache.spark.sql.SparkSession /** * Created by Ram Ghadiyaram */ object RDDRandomSplitExample { org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR) def main(args: Array[String]) { val spark = SparkSession.builder. master("local") .appName("RDDRandomSplitExample") .getOrCreate() val y = spark.sparkContext.parallelize(1 to 100) // break/split big rdd in to small rdd val splits: Array[RDD[Int]] = y.randomSplit(Array(0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1)) splits.foreach(x => println("number of records in each rdd " + x.count)) } }
结果 :
number of records in each rdd 9
number of records in each rdd 9
number of records in each rdd 8
number of records in each rdd 7
number of records in each rdd 9
number of records in each rdd 17
number of records in each rdd 11
number of records in each rdd 9
number of records in each rdd 7
number of records in each rdd 6
number of records in each rdd 8
结论: 您可以在每个RDD中看到几乎相等数量的元素。 您可以处理每个rdd,而无需收集原始的大rdd