我想创建存储在key中指定的路径值中的值的镶木地板文件。
#key is path where the value is to written
rdd = sc.parallelize([('/user/dir_a','tableA'),('/user/dir_b','tableB'),('/user/dir_c','tableC')])
所以在path / user / dir_a中,写了一个
我做了什么:
def writeToHdfs(x):
path = x[0]
outputpath = OUT_DIR + path
log.info('Creating dataframe')
s = SparkSession(sc)
df = s.createDataFrame(x[1], schema))
df.write.parquet(outputpath )
rdd.foreach(writeToHdfs)
感谢。
答案 0 :(得分:1)
我相信,该方案没有开箱即用的解决方案。代码在Scala中,但是python的逻辑是相同的。
val baseRDD = sc.parallelize(Seq(("/user/dir_a", "tableA"), ("/user/dir_b", "tableB"), ("/user/dir_c", "tableC"))).cache()
val groupedRDD = baseRDD.groupByKey()
//Bring the keys to driver.Its little expensive operation
// but we need keys(paths) after all.
val keys = groupedRDD.keys.collect()
//Create RDDs specific to ur paths
val rddList = keys.map { key =>
val rdd = baseRDD.filter(f => f._1.==(key))
(key, rdd)
}
//Now you have list of RDDs specific to paths. iterate each RDD and save them to file
rddList.foreach(f => {
val path = f._1
f._2.values.saveAsTextFile(path)
})
注意:在您认为需要获得性能的任何地方缓存您的RDD。用您各自的方法替换saveAsTextFile(...)
。