因此,我想从S3存储桶中读取一个大型CSV文件,但我不希望该文件完全下载到内存中,我想做的就是以某种方式将文件分块传输然后处理。
到目前为止,这是我所做的,但是我认为这不会解决问题。
import logging
import boto3
import codecs
import os
import csv
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)
s3 = boto3.client('s3')
def lambda_handler(event, context):
# retrieve bucket name and file_key from the S3 event
bucket_name = event['Records'][0]['s3']['bucket']['name']
file_key = event['Records'][0]['s3']['object']['key']
chunk, chunksize = [], 1000
if file_key.endswith('.csv'):
LOGGER.info('Reading {} from {}'.format(file_key, bucket_name))
# get the object
obj = s3.get_object(Bucket=bucket_name, Key=file_key)
file_object = obj['Body']
count = 0
for i, line in enumerate(file_object):
count += 1
if (i % chunksize == 0 and i > 0):
process_chunk(chunk)
del chunk[:]
chunk.append(line)
def process_chunk(chuck):
print(len(chuck))
答案 0 :(得分:0)
这将完成您想要实现的目标。它不会将整个文件下载到内存中,而是分块下载,处理并继续:
from smart_open import smart_open
import csv
def get_s3_file_stream(s3_path):
"""
This function will return a stream of the s3 file.
The s3_path should be of the format: '<bucket_name>/<file_path_inside_the_bucket>'
"""
#This is the full path with credentials:
complete_s3_path = 's3://' + aws_access_key_id + ':' + aws_secret_access_key + '@' + s3_path
return smart_open(complete_s3_path, encoding='utf8')
def download_and_process_csv:
datareader = csv.DictReader(get_s3_file_stream(s3_path))
for row in datareader:
yield process_csv(row) # write a function to do whatever you want to do with the CSV
答案 1 :(得分:0)
您是否尝试过AWS Athena https://aws.amazon.com/athena/? 它非常好的无服务器,现收现付。无需下载文件,它就能完成您想要的一切。 BlazingSql是开源的,在发生大数据问题时也很有用。