Question

我有一个文本文件，其中包含约4264k条记录。我正在将记录拆分为 5 个分区，并希望选择第一个分区进行处理。我该如何实现？

rdd = sc.textFile("file:///user/somelocation/a.txt", 5)

如何选择第一个分区进行进一步处理？

Answer 1

如果我正确理解了您的问题，则希望查看/处理特定分区的数据。

在下面的示例中，我提取了分区2的数据。

val rd =  sc.parallelize((1 to 100), 4) // created an rdd of 1 to 100 numbers in 4 partitions

rd.mapPartitionsWithIndex((index, iter) => Iterator((index,iter.toList)), true) // mapping partitions with its index
    .filter(x => x._1 == 2) // filtering only 2 nd partition 
    .collect.foreach(println)

输出：

(2,List(51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75))  // partition number followed by List of values in the partition

在多个分区上工作

1 个答案: