我的应用程序遇到了瓶颈,并且很难找到解决方案。一点背景:
当前实施:
还有其他检查,以确保正确处理所有拉取队列任务并下载所有项目。
问题:
我们希望尽快下载并存储所有项目和聚合。我为所描述的每个后端配置启用了20个实例(我将它们称为"聚合器"后端和"下载器"后端)。下载器后端似乎相当快地通过API调用。我大量使用NDB库和异步URL Fetches / Datastore调用来获得它。我还启用了threadsafe:true,这样在开始下一个任务之前,任何实例都不会等待RPC调用完成(所有任务都可以独立运行并且是幂等的。)
聚合器后端是大时间接收器发挥作用的地方。通过事务异步存储500-1500个这些聚合需要40秒或更长时间(我甚至不认为所有事务都已正确提交)。我将此后端保留为线程安全:假,因为我使用了300秒的拉队列到期截止日期,但如果我允许在单个实例上执行多个任务,则它们可能会级联并在300秒内完成一些任务标记,从而允许另一个任务第二次拉同一个任务,并可能重复计算。
日志显示BadRequestError: Nested transactions are not supported.
TransactionFailedError: too much contention on these datastore entities. please try again.
之前的错误(在堆栈跟踪中)。我经常看到的另一个错误是BadRequestError(The referenced transaction has expired or is no longer valid.)
根据我的理解,有时这些错误意味着仍然可以在没有进一步交互的情况下提交事务。我如何知道这是否已正确提交?我是以逻辑/有效的方式做到这一点还是有更多的并发空间而没有搞乱一切的风险?
相关守则:
class GeneralShardConfig(ndb.Model):
"""Tracks the number of shards for each named counter."""
name = ndb.StringProperty(required=True)
num_shards = ndb.IntegerProperty(default=4)
class GeneralAggregateShard(ndb.Model):
"""Shards for each named counter"""
name = ndb.StringProperty(name='n', required=True)
count = ndb.FloatProperty(name='c', default=0.00) #acts as a total now
@ndb.tasklet
def increment_batch(data_set):
def run_txn(name, value):
@ndb.tasklet
def txn():
to_put = []
dbkey = ndb.Key(GeneralShardConfig, name)
config = yield dbkey.get_async(use_memcache=False)
if not config:
config = GeneralShardConfig(key=dbkey,name=name)
to_put.append(config)
index = random.randint(0, config.num_shards-1)
shard_name = name + str(index)
dbkey = ndb.Key(GeneralAggregateShard, shard_name)
counter = yield dbkey.get_async()
if not counter:
counter = GeneralAggregateShard(key=dbkey, name=name)
counter.count += value
to_put.append(counter)
yield ndb.put_multi_async(to_put)
return ndb.transaction_async(txn, use_memcache=False, xg=True)
res = yield[run_txn(key, value) for key, value in data_set.iteritems() if value != 0.00]
raise ndb.Return(res)
鉴于实施,"争用的唯一空间"我看到的是,如果2个或更多聚合任务需要更新相同的聚合名称,这不应该过于频繁地发生,并且对于分片计数器,我预计这种重叠很少发生(如果有的话)。我假设了
当事件循环检查所有tasklet的状态并命中对已完成的事务的引用时,将显示BadRequestError(The referenced transaction has expired or is no longer valid.)
错误。这里的问题是错误输出,这是否意味着所有交易都过早地被切断,或者我可以假设所有交易都通过了?我进一步假设这行res = yield[run_txn(key, value) for key, value in data_set.iteritems() if value != 0.00]
需要分解为try /除每个tasklet以检测这些错误。
在我为此烦恼之前,我很感激任何有关如何优化此过程的指导/帮助,并以可靠的方式这样做。
编辑1: 我修改了聚合器任务行为,如下所示:
这有助于减少我一直看到的争用错误,但它仍然不太可靠。最近,我使用指示BadRequestError: Nested transactions are not supported.
RuntimeError: Deadlock waiting for <Future fbf0db50 created by transaction_async(model.py:3345) for tasklet transaction(context.py:806) suspended generator transaction(context.py:876); pending>
我认为,这种修改应该通过允许在单个实例中同时组合和尝试聚合过程中的所有可能重叠来优化过程,而不是所有执行可能发生冲突的事务的多个实例。我仍然以可靠的方式保存结果。
答案 0 :(得分:5)
通过减少数据存储I / O(将工作留给自动验证程序并禁用索引),您可以更加确定数据存储区写入完整(争用较少)并且应该更快。
配置(重命名的计数器)获取在事务之外,并且可以在循环事务的同时并发运行。
将方法和总属性添加到Counter中(希望)将来更容易修改。
为十进制支持创建了一个新的ndb属性(假设这就是为什么要指定0.00而不是0.0)。
修改强>
删除了对事务的需求并更改了分片系统的可靠性。
import webapp2
import copy
import decimal
import logging
import random
import string
from google.appengine.api import datastore_errors
from google.appengine.datastore import entity_pb
from google.appengine.ext import deferred
from google.appengine.ext import ndb
TEST_BATCH_SIZE = 250
TEST_NAME_LEN = 12
class DecimalProperty(ndb.Property):
"""A Property whose value is a decimal.Decimal object."""
def _datastore_type(self, value):
return str(value)
def _validate(self, value):
if not isinstance(value, decimal.Decimal):
raise datastore_errors.BadValueError('Expected decimal.Decimal, got %r'
% (value,))
return value
def _db_set_value(self, v, p, value):
value = str(value)
v.set_stringvalue(value)
if not self._indexed:
p.set_meaning(entity_pb.Property.TEXT)
def _db_get_value(self, v, _):
if not v.has_stringvalue():
return None
value = v.stringvalue()
return decimal.Decimal(value)
class BatchInProgress(ndb.Model):
"""Use a scheduler to delete batches in progress after a certain time"""
started = ndb.DateTimeProperty(auto_now=True)
def clean_up(self):
qry = Shard.query().filter(Shard.batch_key == self.key)
keys = qry.fetch(keys_only=True)
while keys:
ndb.delete_multi(keys)
keys = qry.fetch(keys_only=True)
def cleanup_failed_batch(batch_key):
batch = batch_key.get()
if batch:
batch.clean_up()
batch.delete()
class Shard(ndb.Model):
"""Shards for each named counter"""
counter_key = ndb.KeyProperty(name='c')
batch_key = ndb.KeyProperty(name='b')
count = DecimalProperty(name='v', default=decimal.Decimal('0.00'),
indexed=False)
class Counter(ndb.Model):
"""Tracks the number of shards for each named counter"""
@property
def shards(self):
qry = Shard.query().filter(Shard.counter_key == self.key)
results = qry.fetch(use_cache=False, use_memcache=False)
return filter(None, results)
@property
def total(self):
count = decimal.Decimal('0.00') # Use initial value if no shards
for shard in self.shards:
count += shard.count
return count
@ndb.tasklet
def incr_async(self, value, batch_key):
index = batch_key.id()
name = self.key.id() + str(index)
shard = Shard(id=name, count=value,
counter_key=self.key, batch_key=batch_key)
yield shard.put_async(use_cache=False, use_memcache=False)
def incr(self, *args, **kwargs):
return self.incr_async(*args, **kwargs).get_result()
@ndb.tasklet
def increment_batch(data_set):
batch_key = yield BatchInProgress().put_async()
deferred.defer(cleanup_failed_batch, batch_key, _countdown=3600)
# NOTE: mapping is modified in place, hence copying
mapping = copy.copy(data_set)
# (1/3) filter and fire off counter gets
# so the futures can autobatch
counters = {}
ctr_futs = {}
ctr_put_futs = []
zero_values = set()
for name, value in mapping.iteritems():
if value != decimal.Decimal('0.00'):
ctr_fut = Counter.get_by_id_async(name) # Use cache(s)
ctr_futs[name] = ctr_fut
else:
# Skip zero values because...
zero_values.add(name)
continue
for name in zero_values:
del mapping[name] # Remove all zero values from the mapping
del zero_values
while mapping: # Repeat until all transactions succeed
# (2/3) wait on counter gets and fire off increment transactions
# this way autobatchers should fill time
incr_futs = {}
for name, value in mapping.iteritems():
counter = counters.get(name)
if not counter:
counter = counters[name] = yield ctr_futs.pop(name)
if not counter:
logging.info('Creating new counter %s', name)
counter = counters[name] = Counter(id=name)
ctr_put_futs.append(counter.put_async())
else:
logging.debug('Reusing counter %s', name)
incr_fut = counter.incr_async(value, batch_key)
incr_futs[(name, value)] = incr_fut
# (3/3) wait on increments and handle errors
# by using a tuple key for variable access
for (name, value), incr_fut in incr_futs.iteritems():
counter = counters[name]
try:
yield incr_fut
except:
pass
else:
del mapping[name]
if mapping:
logging.warning('%i increments failed this batch.' % len(mapping))
yield batch_key.delete_async(), ctr_put_futs
raise ndb.Return(counters.values())
class ShardTestHandler(webapp2.RequestHandler):
@ndb.synctasklet
def get(self):
if self.request.GET.get('delete'):
ndb.delete_multi_async(Shard.query().fetch(keys_only=True))
ndb.delete_multi_async(Counter.query().fetch(keys_only=True))
ndb.delete_multi_async(BatchInProgress.query().fetch(keys_only=True))
else:
data_set_test = {}
for _ in xrange(TEST_BATCH_SIZE):
name = ''
for _ in xrange(TEST_NAME_LEN):
name += random.choice(string.letters)
value = decimal.Decimal('{0:.2f}'.format(random.random() * 100))
data_set_test[name] = value
yield increment_batch(data_set_test)
self.response.out.write("Done!")
app = webapp2.WSGIApplication([('/shard_test/', ShardTestHandler)], debug=True)
app = ndb.toplevel(app.__call__)
答案 1 :(得分:5)
特别是关于“引用的事务已过期或不再有效”BadRequestError的主题,这是一个有点广告的事实,事务将比请求更快地超时。从创造开始,你可以免费获得15秒的生命,之后如果连续15秒空闲(因此有效的最短寿命为30秒),交易就会被杀死,并且无论60秒后是什么,都会被杀死。这使得很难并行运行大量事务,因为CPU争用和不公平的tasklet调度算法可能会导致某些事务空闲时间过长。
以下monkeypatch到ndb的事务方法通过重试过期的事务有所帮助,但最终你必须调整你的批处理以减少争用到可管理的级别。
_ndb_context_transaction = ndb.Context.transaction
@ndb.tasklet
def _patched_transaction(self, callback, **ctx_options):
if (self.in_transaction() and
ctx_options.get('propagation') != ndb.TransactionOptions.INDEPENDENT):
raise ndb.Return((yield _ndb_context_transaction(self, callback, **ctx_options)))
attempts = 1
start_time = time.time()
me = random.getrandbits(16)
logging.debug('Transaction started <%04x>', me)
while True:
try:
result = yield _ndb_context_transaction(self, callback, **ctx_options)
except datastore_errors.BadRequestError as e:
if not ('expired' in str(e) and
attempts < _MAX_BAD_REQUEST_RECOVERY_ATTEMPTS):
raise
logging.warning(
'Transaction retrying <%04x> (attempt #%d, %.1f seconds) on BadRequestError: %s',
me, attempts, time.time() - start_time, e)
attempts += 1
else:
logging.debug(
'Transaction finished <%04x> (attempt #%d, %.1f seconds)',
me, attempts, time.time() - start_time)
raise ndb.Return(result)
ndb.Context.transaction = _patched_transaction