我正在运行一个火花作业,该作业按以下步骤运行:
首先它读取文件目录:
data = sc.binaryFiles()
分别处理每个文件:
res = data.map(lambda(x,y):func_1(x,y))
func_1 调用另一个函数 func_2 ,它会分别处理每个文件的内容,并将列表列表返回到 func_1 。现在我需要更改此列表列表以激发rdd并将其写入hdfs。但我不知道该怎么做。
我很新兴。在这种情况下任何帮助将不胜感激。提前谢谢。
已编辑:根据建议,此处为Func1和Func2定义:
def Func_1(filename, file_content):
Outputfile = "some code for generating output file name for each input file"
decode_data = Func_2(StringIO(file_content))
##save decode_data here in HDFS.
def Func2_():
##It does the decoding of the file in a sequence manner (its necessary as each binary file has some headers attach to each portion of the file) and return a list of list where each inner list equivalent to a row of the decoded data and out list is the collection of such rows(skipping the code as it is trivial)