Question

我想创建存储在key中指定的路径值中的值的镶木地板文件。

#key is path where the value is to written
rdd = sc.parallelize([('/user/dir_a','tableA'),('/user/dir_b','tableB'),('/user/dir_c','tableC')])

所以在path / user / dir_a中，写了一个

我做了什么：

def writeToHdfs(x):
  path = x[0]
  outputpath = OUT_DIR + path
  log.info('Creating dataframe')
  s = SparkSession(sc)
  df = s.createDataFrame(x[1], schema))
  df.write.parquet(outputpath )

rdd.foreach(writeToHdfs)

感谢。

Answer 1

我相信，该方案没有开箱即用的解决方案。代码在Scala中，但是python的逻辑是相同的。

 val baseRDD = sc.parallelize(Seq(("/user/dir_a", "tableA"), ("/user/dir_b", "tableB"), ("/user/dir_c", "tableC"))).cache()

    val groupedRDD = baseRDD.groupByKey()

    //Bring the keys to driver.Its little expensive operation
   // but we need keys(paths) after all.
    val keys = groupedRDD.keys.collect()

    //Create RDDs specific to ur paths
    val rddList = keys.map { key =>

      val rdd = baseRDD.filter(f => f._1.==(key))

      (key, rdd)
    }

    //Now you have list of RDDs specific to paths. iterate each RDD and save them to file  
    rddList.foreach(f => {

      val path = f._1
      f._2.values.saveAsTextFile(path)
    })

注意：在您认为需要获得性能的任何地方缓存您的RDD。用您各自的方法替换saveAsTextFile(...)。

在RDD中的key中指定的路径上写入rdd的值

1 个答案: