Question

我正在运行一个火花作业，该作业按以下步骤运行：

首先它读取文件目录：

data = sc.binaryFiles（）
分别处理每个文件：

res = data.map（lambda（x，y）：func_1（x，y））
func_1 调用另一个函数 func_2 ，它会分别处理每个文件的内容，并将列表列表返回到 func_1 。现在我需要更改此列表列表以激发rdd并将其写入hdfs。但我不知道该怎么做。

我很新兴。在这种情况下任何帮助将不胜感激。提前谢谢。

已编辑：根据建议，此处为Func1和Func2定义：

def Func_1(filename, file_content):
       Outputfile  = "some code for generating output file name for each input file"

       decode_data = Func_2(StringIO(file_content))

       ##save decode_data here in HDFS.

def Func2_():
    ##It does the decoding of the file in a sequence manner (its necessary as each binary file has some headers attach to each portion of the file) and return a list of list where each inner list equivalent to a row of the decoded data and out list is the collection of such rows(skipping the code as it is trivial)

将python列表List转换为spark RDD

0 个答案: