在火花中并行写作

时间:2017-07-14 06:44:01

标签: apache-spark apache-spark-sql spark-dataframe

我试图通过将数据分成多个部分来将数据写入azure blob存储,以便每个部分都可以写入不同的Azure blob存储帐户。我看到下面的循环按顺序运行。有没有办法并行化写入?

    var accounts = Array("acct1", "acct2", "acct3", "acct4")

    val numSplits =  Array.fill(4)(0.25)
    val splitDf = df.randomSplit(numSplits)

    val batchCt=0

    splitDf.foreach { ds =>

        val acct = accounts(batchCt)
        val outputFolder = "wasb://test@"+acct+".blob.core.windows.net/json/hourly/%1$tY/%1$tm/%1$td/%1$tH/"
        val outputFile = String.format(outputFolder, currentTime) 
        ds.write.json(outputFile)
        batchCt = batchCt + 1
    }

1 个答案:

答案 0 :(得分:0)

您可以使用mapPartitionsWithIndex来实现目标。代码看起来像这样(我没有尝试使用DataFrames,只使用RDD,但它们可以自由地相互转换):

var accounts = Array("acct1", "acct2", "acct3", "acct4")

val rdd =  sc.parallelize(Array.fill(4)(1)) // dummy data
// We create 4 partitions to write in 4 parallel streams 
// (assuming you have 4 executors)
val splitRdd = rdd.repartition(4).mapPartitionsWithIndex{
    case (ind, vals) =>
        // Here we use the partition number to pick the account
        val acct = accounts(ind) 
        val outputFolder = "wasb://test@"+acct+".blob.core.windows.net/json/hourly/%1$tY/%1$tm/%1$td/%1$tH/"
        vals.foreach{
            v => 
            // ...
            // do the write of value v
        }
}

请注意.repartition实际执行的方式,很容易以不均匀分布的方式结束。