Question

我在纱线集群中有大RDD（1gb）。在使用此群集的本地计算机上，我只有512 MB。我想在本地机器上迭代RDD中的值。我不能使用collect（），因为它会在本地创建太大的数组，这比我的堆更多。我需要一些迭代的方式。有方法iterator（），但它需要一些额外的信息，我无法提供。

UDP：提交到LocalIterator方法

Answer 1

编写原始答案后出现的

更新： RDD.toLocalIterator方法是一种更有效的方法。它使用runJob来评估每个步骤上的单个分区。

TL; DR 原始答案可能会大致了解它是如何运作的：

首先，获取分区索引数组：

val parts = rdd.partitions

然后创建较小的rdds，过滤掉除了单个分区之外的所有内容。从较小的rdds收集数据并迭代单个分区的值：

for (p <- parts) {
    val idx = p.index
    val partRdd = rdd.mapPartitionsWithIndex(a => if (a._1 == idx) a._2 else Iterator(), true)
    //The second argument is true to avoid rdd reshuffling
    val data = partRdd.collect //data contains all values from a single partition 
                               //in the form of array
    //Now you can do with the data whatever you want: iterate, save to a file, etc.
}

我没有尝试过此代码，但应该可以使用。如果不能编译，请写评论。当然，它只有在分区足够小时才会起作用。如果不是，您可以随时使用rdd.coalesce(numParts, true)增加分区数。

Answer 2

Wildfire的答案在语义上似乎是正确的，但我确信你应该能够通过使用Spark的API来提高效率。如果您想依次处理每个分区，我不明白为什么您无法使用map / filter / reduce / reduceByKey / mapPartitions操作。你想要在一个阵列中将所有东西放在一个地方的唯一一次是你要进行非单一操作 - 但这似乎不是你想要的。您应该可以执行以下操作：

rdd.mapPartitions(recordsIterator => your code that processes a single chunk)

或者这个

rdd.foreachPartition(partition => {
  partition.toArray
  // Your code
})

Answer 3

这与@Wildlife的suggested方法相同，但写在pyspark中。

这种方法很好用 - 它允许用户按顺序访问RDD中的记录。我正在使用此代码将RDD中的数据提供给机器学习工具进程的STDIN。

rdd = sc.parallelize(range(100), 10)
def make_part_filter(index):
    def part_filter(split_index, iterator):
        if split_index == index:
            for el in iterator:
                yield el
    return part_filter

for part_id in range(rdd.getNumPartitions()):
    part_rdd = rdd.mapPartitionsWithIndex(make_part_filter(part_id), True)
    data_from_part_rdd = part_rdd.collect()
    print "partition id: %s elements: %s" % (part_id, data_from_part_rdd)

产生输出：

partition id: 0 elements: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
partition id: 1 elements: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
partition id: 2 elements: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
partition id: 3 elements: [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
partition id: 4 elements: [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
partition id: 5 elements: [50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
partition id: 6 elements: [60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
partition id: 7 elements: [70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
partition id: 8 elements: [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
partition id: 9 elements: [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

Answer 4

使用RDD.toLocalIterator()的pyspark数据框解决方案：

separator  = '|'
df_results = hiveCtx.sql(sql)
columns    = df_results.columns
print separator.join(columns)

# Use toLocalIterator() rather than collect(), as this avoids pulling all of the
# data to the driver at one time.  Rather, "the iterator will consume as much memory
# as the largest partition in this RDD."
MAX_BUFFERED_ROW_COUNT = 10000
row_count              = 0
output                 = cStringIO.StringIO()
for record in df_results.rdd.toLocalIterator():
    d = record.asDict()
    output.write(separator.join([str(d[c]) for c in columns]) + '\n')
    row_count += 1
    if row_count % MAX_BUFFERED_ROW_COUNT== 0:
        print output.getvalue().rstrip()
        # it is faster to create a new StringIO rather than clear the existing one
        # http://stackoverflow.com/questions/4330812/how-do-i-clear-a-stringio-object
        output = cStringIO.StringIO()
if row_count % MAX_BUFFERED_ROW_COUNT:
    print output.getvalue().rstrip()

Answer 5

使用Spark映射/过滤/减少并稍后下载结果？我认为通常的Hadoop方法会起作用。

Api说有map-filter-saveAsFile命令：https://spark.incubator.apache.org/docs/0.8.1/scala-programming-guide.html#transformations

Answer 6

对于Spark 1.3.1，格式如下

val parts = rdd.partitions
    for (p <- parts) {
        val idx = p.index
        val partRdd = data.mapPartitionsWithIndex { 
           case(index:Int,value:Iterator[(String,String,Float)]) => 
             if (index == idx) value else Iterator()}
        val dataPartitioned = partRdd.collect 
        //Apply further processing on data                      
    }

Spark：从RDD检索大数据到本地计算机的最佳实践

6 个答案: