Question

我有一个名为＆＃34; Source＆＃34;的S3存储桶。很多＆＃39; .tgz＆＃39;文件正在实时推送到该存储桶中。我写了一个Java代码来提取＆＃39; .tgz＆＃39;归档并将其推入＆＃34;目的地＆＃34;桶。我将我的代码推送为Lambda函数。我得到了＆＃39; .tgz＆＃39;在我的Java代码中将文件作为InputStream。如何在Lambda中提取它？我无法在Lambda中创建文件，它会抛出＆＃34; FileNotFound（Permission Denied）＆＃34;在JAVA。

AmazonS3 s3Client = new AmazonS3Client();
S3Object s3Object = s3Client.getObject(new GetObjectRequest(srcBucket, srcKey));
InputStream objectData = s3Object.getObjectContent();
File file = new File(s3Object.getKey());
OutputStream writer = new BufferedOutputStream(new FileOutputStream(file)); <--- It throws FileNotFound(Permission denied) here

Answer 1

import boto3
import botocore
import tarfile
from tarfile import TarInfo
from botocore.client import Config
s3_client = boto3.client('s3')
s3_resource=boto3.resource('s3')
def lambda_handler(event, context):
    bucket =event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    new_bucket='uncompressed-data' #new bucket name
    new_key=key[:-4]
    try:
        s3_client.download_file(bucket, key, '/tmp/file')
        if(tarfile.is_tarfile('/tmp/file')):
           tar = tarfile.open('/tmp/file', "r:gz")
           for TarInfo in tar:
               tar.extract(TarInfo.name, path='/tmp/extract/')
        s3_client.upload_file('/tmp/extract/'+TarInfo.name,new_bucket, new_key)
        tar.close()
    except Exception as e:
        print(e)
        raise e

使用Python 3.6并为后缀为“.tgz”的obejctcreated（all）触发事件。希望这对你有所帮助。看看这个Link

Answer 2

请勿使用File或FileOutputStream，请使用s3Client.putObject()。要读取tgz文件，您可以使用Apache Commons Compress。例如：

ArchiveInputStream tar = new ArchiveInputStreamFactory().
    createArchiveInputStream("tar", new GZIPInputStream(objectData));
ArchiveEntry entry;
while ((entry = tar.getNextEntry()) != null) {
    if (!entry.isDirectory()) {
        byte[] objectBytes = new byte[entry.getSize()];
        tar.read(objectBytes);
        ObjectMetadata metadata = new ObjectMetadata();
        metadata.setContentLength(objectBytes.length);
        metadata.setContentType("application/octet-stream");
        s3Client.putObject(destBucket, entry.getName(), 
            new ByteArrayInputStream(objectBytes), metadata);
    }
}

Answer 3

由于其中一个响应是在Python中提供的，因此我以这种语言提供了替代解决方案。

使用 / tmp 文件系统的解决方案的问题是，AWS仅允许在其中存储 512 MB （read more）。为了解压缩或解压缩较大的文件，最好使用 io 软件包和BytesIO类并仅在内存中处理文件内容。 AWS允许为Lambda分配最多3GB的RAM，这极大地扩展了最大文件大小。我已成功测试了1GB S3文件的解压缩。

就我而言，将2000个文件从1GB的tar文件解压缩到另一个S3存储桶需要140秒。通过利用多个线程将未配置文件上传到目标S3存储桶，可以进一步对其进行优化。

下面的示例代码提供了单线程解决方案：

import boto3
import botocore
import tarfile

from io import BytesIO
s3_client = boto3.client('s3')

def untar_s3_file(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    input_tar_file = s3_client.get_object(Bucket = bucket, Key = key)
    input_tar_content = input_tar_file['Body'].read()

    with tarfile.open(fileobj = BytesIO(input_tar_content)) as tar:
        for tar_resource in tar:
            if (tar_resource.isfile()):
                inner_file_bytes = tar.extractfile(tar_resource).read()
                s3_client.upload_fileobj(BytesIO(bytes_content), Bucket = bucket, Key = tar_resource.name)

AWS Lambda：如何在S3存储桶中提取tgz文件并将其放入另一个S3存储桶中

3 个答案: