我正在编写Scala程序来读取与S3上某个prefix
匹配的对象。
目前,我正在Macbook Pro上进行测试,需要270毫秒(平均超过1000次试验)才能点击S3,检索10个对象(平均大小150Kb)并处理打印输出。
这是我的代码:
val myBucket = "my-test-bucket"
val myPrefix = "t"
val startTime = System.currentTimeMillis()
//Can I make listObject parallel?
val listObjRequest: ListObjectsRequest = new ListObjectsRequest().withBucketName(myBucket)
val listObjResult: Seq[String] = s3.listObjects(listObjRequest).getObjectSummaries.par.toIndexedSeq.map(_.getKey).filter(_ matches s"./.*${myPrefix}.*/*")
//Can I make forEach parallel?
listObjResult foreach println //Could be any function
println(s"Total time: ${System.currentTimeMillis() - startTime}ms")
在大的方案中,我必须筛选50Gb的数据(大约350K嵌套对象)并删除遵循特定前缀的对象(大约40K对象)。
除了硬件方面的考虑,我该怎么做才能优化我的代码?
谢谢!
答案 0 :(得分:0)
可能的解决方案是批处理请求对象并在S3中发送批量删除请求。您可以将要删除的对象分组,然后通过并行集合对映射进行并行化:
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.DeleteObjectsRequest.KeyVersion
import com.amazonaws.services.s3.model.{DeleteObjectsRequest, DeleteObjectsResult}
import scala.collection.JavaConverters._
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.util.Try
object AmazonBatchDeletion {
def main(args: Array[String]): Unit = {
val filesToDelete: List[String] = ???
val numOfGroups: Int = ???
val deletionAttempts: Iterator[Future[Try[DeleteObjectsResult]]] =
filesToDelete
.grouped(numOfGroups)
.map(groupToDelete => Future {
blocking {
deleteFilesInBatch(groupToDelete, "bucketName")
}
})
val result: Future[Iterator[Try[DeleteObjectsResult]]] =
Future.sequence(deletionAttempts)
// TODO: make sure deletion was successful.
// Recover if needed form faulted futures.
}
def deleteFilesInBatch(filesToDelete: List[String],
bucketName: String): Try[DeleteObjectsResult] = {
val amazonClient = new AmazonS3Client()
val deleteObjectsRequest = new DeleteObjectsRequest(bucketName)
deleteObjectsRequest.setKeys(filesToDelete.map(new KeyVersion(_)).asJava)
Try {
amazonClient.deleteObjects(deleteObjectsRequest)
}
}
}