我试图通过将数据分成多个部分来将数据写入azure blob存储,以便每个部分都可以写入不同的Azure blob存储帐户。我看到下面的循环按顺序运行。有没有办法并行化写入?
var accounts = Array("acct1", "acct2", "acct3", "acct4")
val numSplits = Array.fill(4)(0.25)
val splitDf = df.randomSplit(numSplits)
val batchCt=0
splitDf.foreach { ds =>
val acct = accounts(batchCt)
val outputFolder = "wasb://test@"+acct+".blob.core.windows.net/json/hourly/%1$tY/%1$tm/%1$td/%1$tH/"
val outputFile = String.format(outputFolder, currentTime)
ds.write.json(outputFile)
batchCt = batchCt + 1
}
答案 0 :(得分:0)
您可以使用mapPartitionsWithIndex来实现目标。代码看起来像这样(我没有尝试使用DataFrames,只使用RDD,但它们可以自由地相互转换):
var accounts = Array("acct1", "acct2", "acct3", "acct4")
val rdd = sc.parallelize(Array.fill(4)(1)) // dummy data
// We create 4 partitions to write in 4 parallel streams
// (assuming you have 4 executors)
val splitRdd = rdd.repartition(4).mapPartitionsWithIndex{
case (ind, vals) =>
// Here we use the partition number to pick the account
val acct = accounts(ind)
val outputFolder = "wasb://test@"+acct+".blob.core.windows.net/json/hourly/%1$tY/%1$tm/%1$td/%1$tH/"
vals.foreach{
v =>
// ...
// do the write of value v
}
}
请注意.repartition
实际执行的方式,很容易以不均匀分布的方式结束。