在一个1000万条记录发电机数据库中,每个项目都有一个“纪元”时间戳属性,我试图计算一系列纪元之间的项目。 发电机表的预置读取容量单位为1000.每个项目为5~7Kb。
代码:
from boto3.session import Session
from boto3.dynamodb.conditions import Attr
START_EPOCH = 1443657600000
END_EPOCH = 1443744000000
TOTAL_ITEMS_TO_SCAN = 1000000
F_EXP = Attr('epoch').gt(START_EPOCH) & Attr('epoch').lt(END_EPOCH)
session = Session(aws_access_key_id='access_key',
aws_secret_access_key='secret_key',
region_name='region')
resource = session.resource('dynamodb')
table = resource.Table('table name')
def scan_func(last_key,counter,scanned):
if last_key:
result = table.scan(FilterExpression=F_EXP,
Select='COUNT',
ExclusiveStartKey=last_key)
else:
result = table.scan(FilterExpression=F_EXP,
Select='COUNT')
counter += result['Count']
scanned += result['ScannedCount']
print "Current items found {} from {} scanned".format(counter, scanned)
if counter < TOTAL_ITEMS_TO_SCAN:
scan_func(result['LastEvaluatedKey'], counter, scanned)
print 'total items founs : {}, from {} scanned'.format(counter,scanned)
scan_func(None, 0, 0)
即使使用段运行扫描,平均在几次迭代后我得到以下响应:
botocore.exceptions.ClientError: An error occurred (ProvisionedThroughputExceededException) when calling the Scan operation: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API
到目前为止,我得到的最好结果是:
Current items found 16 from 3245 scanned
我还尝试在每次迭代之间引发2秒的睡眠,以便为恢复和释放配置资源提供数据库空间,并且也无法正常工作。 还尝试将配置资源的3倍增加到3,000而不是1,000,并且它做了一些迭代但最终停止了。
有关如何使这项工作的任何想法? 其他替代方案没有增加表的读取容量?