应用错误收集

处理HDFS目录时如何识别RDD源名称

时间：2019-03-14 22:40:29

标签： apache-spark pyspark

在spark中，您可以使用sc.texFile处理 HDFS目录，如何打印正在处理的当前文件名和文件内容？

def get_data(x):
    return (x) #I want this to return source file name + line content

textFile = sc.textFile("hdfs://hadoop.localdomain/user/sw/pdf/") #porcess WHOLE directory

words_filter = textFile.map(get_data)

0 个答案:

没有答案