Question

我试图通过将数据分成多个部分来将数据写入azure blob存储，以便每个部分都可以写入不同的Azure blob存储帐户。我看到下面的循环按顺序运行。有没有办法并行化写入？

    var accounts = Array("acct1", "acct2", "acct3", "acct4")

    val numSplits =  Array.fill(4)(0.25)
    val splitDf = df.randomSplit(numSplits)

    val batchCt=0

    splitDf.foreach { ds =>

        val acct = accounts(batchCt)
        val outputFolder = "wasb://test@"+acct+".blob.core.windows.net/json/hourly/%1$tY/%1$tm/%1$td/%1$tH/"
        val outputFile = String.format(outputFolder, currentTime) 
        ds.write.json(outputFile)
        batchCt = batchCt + 1
    }

Answer 1

您可以使用mapPartitionsWithIndex来实现目标。代码看起来像这样（我没有尝试使用DataFrames，只使用RDD，但它们可以自由地相互转换）：

var accounts = Array("acct1", "acct2", "acct3", "acct4")

val rdd =  sc.parallelize(Array.fill(4)(1)) // dummy data
// We create 4 partitions to write in 4 parallel streams 
// (assuming you have 4 executors)
val splitRdd = rdd.repartition(4).mapPartitionsWithIndex{
    case (ind, vals) =>
        // Here we use the partition number to pick the account
        val acct = accounts(ind) 
        val outputFolder = "wasb://test@"+acct+".blob.core.windows.net/json/hourly/%1$tY/%1$tm/%1$td/%1$tH/"
        vals.foreach{
            v => 
            // ...
            // do the write of value v
        }
}

请注意.repartition实际执行的方式，很容易以不均匀分布的方式结束。

在火花中并行写作

1 个答案: