在Scala中并行访问S3对象

时间:2016-11-18 05:25:31

标签: scala amazon-web-services amazon-s3

我正在编写Scala程序来读取与S3上某个prefix匹配的对象。

目前,我正在Macbook Pro上进行测试,需要270毫秒(平均超过1000次试验)才能点击S3,检索10个对象(平均大小150Kb)并处理打印输出。

这是我的代码:

val myBucket = "my-test-bucket"
val myPrefix = "t"
val startTime = System.currentTimeMillis() 

//Can I make listObject parallel?
val listObjRequest: ListObjectsRequest = new ListObjectsRequest().withBucketName(myBucket)
val listObjResult: Seq[String] = s3.listObjects(listObjRequest).getObjectSummaries.par.toIndexedSeq.map(_.getKey).filter(_ matches s"./.*${myPrefix}.*/*")

//Can I make forEach parallel? 
listObjResult foreach println //Could be any function

println(s"Total time: ${System.currentTimeMillis() - startTime}ms")

在大的方案中,我必须筛选50Gb的数据(大约350K嵌套对象)并删除遵循特定前缀的对象(大约40K对象)。

除了硬件方面的考虑,我该怎么做才能优化我的代码?

谢谢!

1 个答案:

答案 0 :(得分:0)

可能的解决方案是批处理请求对象并在S3中发送批量删除请求。您可以将要删除的对象分组,然后通过并行集合对映射进行并行化:

import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.DeleteObjectsRequest.KeyVersion
import com.amazonaws.services.s3.model.{DeleteObjectsRequest, DeleteObjectsResult}

import scala.collection.JavaConverters._
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.util.Try

object AmazonBatchDeletion {
  def main(args: Array[String]): Unit = {
    val filesToDelete: List[String] = ???
    val numOfGroups: Int = ???

    val deletionAttempts: Iterator[Future[Try[DeleteObjectsResult]]] =
      filesToDelete
       .grouped(numOfGroups)
       .map(groupToDelete => Future {
          blocking {
            deleteFilesInBatch(groupToDelete, "bucketName")  
          }
        })

    val result: Future[Iterator[Try[DeleteObjectsResult]]] = 
      Future.sequence(deletionAttempts)

    // TODO: make sure deletion was successful.
    // Recover if needed form faulted futures.
  }

  def deleteFilesInBatch(filesToDelete: List[String], 
                         bucketName: String): Try[DeleteObjectsResult] = {
    val amazonClient = new AmazonS3Client()

    val deleteObjectsRequest = new DeleteObjectsRequest(bucketName)
    deleteObjectsRequest.setKeys(filesToDelete.map(new KeyVersion(_)).asJava)

    Try {
      amazonClient.deleteObjects(deleteObjectsRequest)
    }
  }
}