可以在pyspark中获取多个输入文件而不创建一个RDD?

时间:2017-10-10 15:56:03

标签: hadoop pyspark

在Hadoop中,我可以将应用指向路径,然后映射器将单独处理文件。我必须以这种方式处理它,因为我需要解析文件名和路径以匹配我直接在映射器中加载的其他文件。

在pyspark中,将路径传递给SparkContext的textFile会创建一个RDD。有没有办法在Spark / pyspark中复制相同的Hadoop行为?

2 个答案:

答案 0 :(得分:1)

我希望这能解决你的一些困惑: sparkContext.wholeTextFiles(path)会返回pairRDD(有用的链接:https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html

简而言之,pairRDD更像是一张地图(即有键,值)

rdd = sparkContext.wholeTextFiles(path)

def func_work_on_individual_files(x):
   # x is a tuple which will receive both (key, value) for the pairRDD Row Elements passed. key -> file path, value -> content of a file with line seperated by '/n' (as you mentioned). To access key use x[0], to access value use x[1]. 
   # your logic to do something useful with file data, 
   # to get separate lines you can use: x[1].split('\n')
   # end function by return the values you want to return out of a file's data. 

   # I am simply returning the whole content of file 
   return x[1] 


#loop over each of the file in the pairRdd created above
file_contents = rdd.map(func_work_on_individual_files)

#this will create just one partition out of all elements in list (as you mentioned)
consolidated_contents = file_contents.repartition(1)

#Save final output - this will create just one path like Hadoop
consolidated_contents.saveAsTextFile(path)

答案 1 :(得分:0)

Pyspark为此用例提供了一个函数:sparkContext.wholeTextFiles(path)。它将读取文本文件的目录并生成键值对,其中key是每个文件的路径,value是每个文件的内容。