我在S3中保存了数以千计的对象。我的要求需要我加载这些对象的子集(在5到~3000之间)并读取每个对象的二进制内容。通过阅读boto3 / AWS CLI文档,看起来它不可能在一个请求中获取多个对象,所以目前我已经将其实现为构造每个对象的键的循环,对象的请求然后读取对象的主体:
for column_key in outstanding_column_keys:
try:
s3_object_key = "%s%s-%s" % (path_prefix, key, column_key)
data_object = self.s3_client.get_object(Bucket=bucket_key, Key=s3_object_key)
metadata_dict = data_object["Metadata"]
metadata_dict["key"] = column_key
metadata_dict["version"] = float(metadata_dict["version"])
metadata_dict["data"] = data_object["Body"].read()
records.append(Record(metadata_dict))
except Exception as exc:
logger.info(exc)
if len(records) < len(column_keys):
raise Exception("Some objects are missing!")
我的问题是,当我尝试获取多个对象(例如5个对象)时,我会返回3,并且在检查是否已加载所有对象时,有些未处理。我在自定义异常中处理它。我想出一个解决方案,将上面的代码片段包装在while循环中,因为我知道我需要的优秀键:
while (len(outstanding_column_keys) > 0) and (load_attempts < 10):
for column_key in outstanding_column_keys:
try:
s3_object_key = "%s%s-%s" % (path_prefix, key, column_key)
data_object = self.s3_client.get_object(Bucket=bucket_key, Key=s3_object_key)
metadata_dict = data_object["Metadata"]
metadata_dict["key"] = column_key
metadata_dict["version"] = float(metadata_dict["version"])
metadata_dict["data"] = data_object["Body"].read()
records.append(Record(metadata_dict))
except Exception as exc:
logger.info(exc)
if len(records) < len(column_keys):
raise Exception("Some objects are missing!")
但我认为S3确实仍在处理未完成的响应,而while循环会不必要地对S3已经在返回的对象提出额外的请求。
我进行了单独的调查,以验证get_object
请求是否同步,看起来是:
import boto3
import time
import os
s3_client = boto3.client('s3', aws_access_key_id=os.environ["S3_AWS_ACCESS_KEY_ID"], aws_secret_access_key=os.environ["S3_AWS_SECRET_ACCESS_KEY"])
print "Saving 3000 objects to S3..."
start = time.time()
for x in xrange(3000):
key = "greeting_{}".format(x)
s3_client.put_object(Body="HelloWorld!", Bucket='bucket_name', Key=key)
end = time.time()
print "Done saving 3000 objects to S3 in %s" % (end - start)
print "Sleeping for 20 seconds before trying to load the saved objects..."
time.sleep(20)
print "Loading the saved objects..."
arr = []
start_load = time.time()
for x in xrange(3000):
key = "greeting_{}".format(x)
try:
obj = s3_client.get_object(Bucket='bucket_name', Key=key)
arr.append(obj)
except Exception as exc:
print exc
end_load= time.time()
print "Done loading the saved objects. Found %s objects. Time taken - %s" % (len(arr), end_load - start_load)
我的问题和我需要确认的事情是:
get_object
请求是否确实同步?如果他们是那么我希望当我在第一个检查加载的对象时
然后应该返回所有这些代码片段。get_object
请求是异步的,那么如何以避免向S3发出额外请求的方式处理响应
仍在回归的对象?谢谢!
答案 0 :(得分:0)
与Javascript不同,Python会同步处理请求,除非您执行某种多线程处理(在上面的代码段中没有这样做)。在for循环中,您向s3_client.get_object
发出请求,该调用将阻塞直到返回数据。由于records
数组小于应有的数组,因此这必须表示已引发某些异常,并且应将其捕获在except块中:
except Exception as exc:
logger.info(exc)
如果未打印任何内容,则可能是因为日志记录配置为忽略INFO级别的消息。如果没有看到任何错误,则可以尝试使用logger.error
进行打印。