Question

>>> rdd.collect()
[1, 2, 3, 4]

定义两个我认为应该相同的地图函数

def map2(l):
    result=[]
    for i in l:
         result.append(((i,i),1))
         result.append((i,1))
    for j in result:
         yield j

输出

>>>rdd.mapPartitions(map2).collect()
[((1, 1), 1), (1, 1), ((2, 2), 1), (2, 1), ((3, 3), 1), (3, 1), ((4, 4), 1), (4, 1)]

另一项功能

def map2(l):
    result=[]
    for i in l:
            result.append(((i,i),1))
    for i in l:
            result.append((i,1))
    for j in result:
            yield j

输出

>>> rdd.mapPartitions(map2).collect()
[((1, 1), 1), ((2, 2), 1), ((3, 3), 1), ((4, 4), 1)]

Answer 1

迭代器是有状态的，不能多次遍历。您在第二个示例中执行的第一个for循环使用所有可用项并留下一个空迭代器，因此在第二个循环中没有任何内容可以添加。

如果要遍历多次，请将迭代器转换为列表：

def map2(l):
    l = list(l)
    result=[]
    ...

pyspark map函数返回不同

1 个答案: