Question

我在我的脚本中使用for循环来为size_DF（数据框）的每个元素调用一个函数，但这需要花费很多时间。我尝试通过地图删除for循环，但我没有得到任何输出 size_DF是我从表中获取的大约300个元素的列表。

使用For：

import call_functions

newObject = call_functions.call_functions_class()
size_RDD = sc.parallelize(size_DF) 

if len(size_DF) == 0:
    print "No record present in the truncated list"
else:

    for row in size_DF:
        length = row[0]
        print "length: ", length
        insertDF = newObject.full_item(sc, dataBase, length, end_date)

使用地图

if len(size_DF) == 0:
    print "No record present in the list"
else:
    size_RDD.mapPartition(lambda l: newObject.full_item(sc, dataBase, len(l[0]), end_date))

newObject.full_item（sc，dataBase，len（l [0]），end_date） 在full_item（）中 - 我正在做一些select ope并连接2个表并将数据插入表中。

请帮助我，让我知道我做错了什么。

Answer 1

pyspark.rdd.RDD.mapPartition方法被懒惰地评估。通常为了强制进行评估，您可以使用一个方法在返回的惰性RDD实例上返回一个值。

有更高级别的功能负责强制评估RDD值。例如 pyspark.rdd.RDD.foreach

由于您并不真正关心操作的结果，因此您可以使用pyspark.rdd.RDD.foreach代替pyspark.rdd.RDD.mapPartition。

def first_of(it):
    for first in it:
        return first
    return []

def insert_first(it):
    first = first_of(it)
    item_count = len(first)
    newObject.full_item(sc, dataBase, item_count, end_date)


if len(size_DF) == 0:
    print('No record present in the truncated list')
else:
    size_DF.forEach(insert_first)

在pyspark中替换循环到并行进程

1 个答案: