在RDD中的key中指定的路径上写入rdd的值

时间:2017-03-24 20:27:24

标签: apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql

我想创建存储在key中指定的路径值中的值的镶木地板文件。

#key is path where the value is to written
rdd = sc.parallelize([('/user/dir_a','tableA'),('/user/dir_b','tableB'),('/user/dir_c','tableC')])

所以在path / user / dir_a中,写了一个

我做了什么:

def writeToHdfs(x):
  path = x[0]
  outputpath = OUT_DIR + path
  log.info('Creating dataframe')
  s = SparkSession(sc)
  df = s.createDataFrame(x[1], schema))
  df.write.parquet(outputpath )

rdd.foreach(writeToHdfs)

感谢。

1 个答案:

答案 0 :(得分:1)

我相信,该方案没有开箱即用的解决方案。代码在Scala中,但是python的逻辑是相同的。

 val baseRDD = sc.parallelize(Seq(("/user/dir_a", "tableA"), ("/user/dir_b", "tableB"), ("/user/dir_c", "tableC"))).cache()

    val groupedRDD = baseRDD.groupByKey()

    //Bring the keys to driver.Its little expensive operation
   // but we need keys(paths) after all.
    val keys = groupedRDD.keys.collect()

    //Create RDDs specific to ur paths
    val rddList = keys.map { key =>

      val rdd = baseRDD.filter(f => f._1.==(key))

      (key, rdd)
    }

    //Now you have list of RDDs specific to paths. iterate each RDD and save them to file  
    rddList.foreach(f => {

      val path = f._1
      f._2.values.saveAsTextFile(path)
    })

注意:在您认为需要获得性能的任何地方缓存您的RDD。用您各自的方法替换saveAsTextFile(...)