Question

假设我有以下RDD：

test1 = (('trial1',[1,2]),('trial2',[3,4]))
test1RDD = sc.parallelize(test1)

如何创建以下rdd：

((1,'trial1',[1,2]),(2,'trial2',[3,4]))

我尝试使用累加器但它不起作用，因为累加器无法在任务中访问：

def increm(keyvalue):
    global acc
    acc +=1
    return (acc.value,keyvalue[0],keyvalue[1])


acc = sc.accumulator(0)
test1RDD.map(lambda x: increm(x)).collect()

知道如何做到这一点？

Answer 1

您可以使用zipWithIndex

zipWithIndex（）

使用元素索引将此RDD拉开。

排序首先基于分区索引，然后是   每个分区内的项目排序。所以第一项   第一个分区获取索引0，最后一个分区获取最后一个分区   收到最大的指数。

当此RDD包含更多内容时，此方法需要触发spark作业   比一个分区。

  >>> sc.parallelize(["a", "b", "c", "d"], 3).zipWithIndex().collect()
[('a', 0), ('b', 1), ('c', 2), ('d', 3)]

并使用map转换RDD以使索引位于新RDD前面

这是未经测试的，因为我没有任何环境：

test1 = (('trial1',[1,2]),('trial2',[3,4]))
test1RDD = sc.parallelize(test1)
test1RDD.zipWithIndex().map(lambda x : (x[1],x[0]))

在RDD中添加递增变量

1 个答案: