Question

在groupByKey()之后，我的RDD只有一行，例如(0, [a list of name]) 用例：将名称列表写入s3上的文件。因为这个RDD只有一行，我直接使用foreach()

代码是：

def write_to_s3(keyValue):
    lines = keyValue[1]
    tmp_file = ...
    with open(tmp_file, w+) as f:
        for line in lines:
            f.write(line + '\n')
    # upload tmp_file to s3
    # remove tmp_file

myRDD.foreach(write_to_s3)

我的问题出在函数write_to_s3()，lines = keyValue[1]中，因为行（列表）太大会导致内存爆炸吗？

Answer 1

我的问题是函数write_to_s3（），lines = keyValue [1]，是否因为行（列表）太大而导致内存爆炸？

不，但是groupByKey单独就可能会这样做。换句话说，如果密钥的数据很大，则代码可能会失败。

您拥有的最佳选择是将DataFrameWriter与partitionBy一起使用，然后合并并输出结果。

df_before_group_by_key.toDF(["key", "value"]).write.partitionBy("key").text("some_file")

如果无法尝试：

df_before_group_by_key.repartitionAndSortWithinPartitions(...).mapPartitions(write_to_s2)

其中

def write_to_s3(iterator):
     ... # 1. check the first key
     ... # 2. open new file
     ... # 3. write until you encounter new one
     ... # 4. upload
     ... # 5. go back to 2, repeat until iterator is empty

Spark foreach（keyValue），值是否导致内存爆炸？

1 个答案: