Question

我收到此错误，但我不知道为什么。基本上我错误地从这段代码：

    a = data.mapPartitions(helper(locations))

其中data是RDD，我的助手定义为：

    def helper(iterator, locations): 
        for x in iterator:
            c = locations[x]
            yield c

（位置只是一个数据点数组）我不知道问题是什么，但我也不是pyspark最好的，所以有人可以告诉我为什么我会得到＆＃39; PipelinedRDD＆＃39; object无法从此代码中迭代？

Answer 1

RDD可以使用map和lambda函数进行迭代。我使用以下方法迭代了Pipelined RDD

lines1 = sc.textFile("\..\file1.csv")
lines2 = sc.textFile("\..\file2.csv")

pairs1 = lines1.map(lambda s: (int(s), 'file1'))
pairs2 = lines2.map(lambda s: (int(s), 'file2'))

pair_result = pairs1.union(pairs2)

pair_result.reduceByKey(lambda a, b: a + ','+ b)

result = pair.map(lambda l: tuple(l[:1]) + tuple(l[1].split(',')))
result_ll = [list(elem) for elem in result]

===＆GT; result_ll = [结果中elem的列表（elem）]

TypeError：'PipelinedRDD'对象不可迭代

而不是使用map函数替换迭代

result_ll = result.map( lambda elem: list(elem))

希望这有助于相应地修改您的代码

Answer 2

我更喜欢以下链接中另一个问题中给出的答案： Can not access Pipelined Rdd in pyspark

您不能迭代RDD，首先需要调用一个操作以将数据返回驱动程序。快速样本：

`>>> test = sc.parallelize([1,2,3])
 >>> for i in test:
     ...    print i
     ... 
     Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     TypeError: 'RDD' object is not iterable`

但是例如您可以使用'.collect（）'

`>>> for i in test.collect():
     ...      print i
 1                                                                               
 2
 3`

pyspark：＆＃39; PipelinedRDD＆＃39;对象不可迭代

2 个答案: