自从切换到基于线程/回调的最新python库以来,我们的pubsub生产者和消费者之间的关系一直很慢。我们对google的pubsub相对较新,我们想知道其他人是否在最近的图书馆更改后遇到类似的问题,或者知道我们可能错过的设置。
从推送消息到3个工作人员消耗的时间(在python中),我们看到意外的减速。我们的处理程序需要花费很少的时间来处理每个请求,我们还在运行处理程序之前更改了代码以调用message.ack()
。例如。 self.sub_client.subscribe(subscription_path, callback=self.message_callback)
。这些消息不是重复的。当我们将它们排队时,我们会记录msecs中的时间,以了解它们在队列中的时间。
for pod in worker-staging-deployment-1003989621-2mx0n worker-staging-deployment-1003989621-b6llt worker-staging-deployment-1003989621-lx4gq; do echo == $pod ==; kubectl logs $pod -c fra-worker | grep 'ACK start'; done
== fra-worker-staging-deployment-1003989621-2mx0n ==
[2017-09-25 23:29:03,987] {pubsub.py:147} INFO - ACK start: 22 ms for 1506382143.88 (0.10699987411499023 secs)
[2017-09-25 23:29:04,966] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382144.767 (0.19900012016296387 secs)
[2017-09-25 23:29:14,708] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382144.219 (10.488999843597412 secs)
[2017-09-25 23:29:17,706] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382147.229 (10.476999998092651 secs)
[2017-09-25 23:29:37,767] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.782 (32.984999895095825 secs)
[2017-09-25 23:30:00,649] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382146.257 (54.39199995994568 secs)
== fra-worker-staging-deployment-1003989621-b6llt ==
[2017-09-25 23:29:04,083] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.957 (0.12599992752075195 secs)
[2017-09-25 23:29:05,261] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382144.916 (0.3450000286102295 secs)
[2017-09-25 23:29:15,703] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.336 (11.367000102996826 secs)
[2017-09-25 23:29:25,630] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.812 (21.818000078201294 secs)
[2017-09-25 23:29:38,706] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382144.49 (34.21600008010864 secs)
[2017-09-25 23:30:01,752] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382146.696 (55.055999994277954 secs)
== fra-worker-staging-deployment-1003989621-lx4gq ==
[2017-09-25 23:29:03,342] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382142.889 (0.4530000686645508 secs)
[2017-09-25 23:29:04,955] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.907 (1.0469999313354492 secs)
[2017-09-25 23:29:14,704] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382143.888 (10.815999984741211 secs)
[2017-09-25 23:29:17,705] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382147.205 (10.5 secs)
[2017-09-25 23:29:37,767] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.197 (33.5699999332428 secs)
[2017-09-25 23:29:59,733] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.269 (55.46399998664856 secs)
[2017-09-25 23:31:18,870] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382146.924 (131.94599986076355 secs)
最初看起来消息从排队到阅读花费的时间很短,但随后他们开始播放,之后就好像有10秒,32秒,55秒的延迟。 (并且这些不是重复的,因此由于失败的ack而不是重试逻辑。)
我们写了一个小测试,它可以快速处理少量发送者和消息,但是一旦我们将消息提升到1500而发送者增加到3,我们就会看到发布调用通常会返回一个带有异常结果的未来{{1结果显示大约500条消息/秒,但错误率> 10%的"PublishError('Some messages were not successfully published.
次调用会抛出此异常
publish()
虽然我们的发送者在3秒钟内完成(这些并行运行)但是工作人员正在收到20秒前入队的消息
Done in 2929 ms, 512.12 qps (154 10.3%)
Done in 2901 ms, 517.06 qps (165 11.0%)
Done in 2940 ms, 510.20 qps (217 14.5%)
这是工人/听众:
Got message {'tstamp': '1506557436.988', 'msg': 'msg#393@982'} 20.289 sec
发件人/发布商:
import time
from google.api.core.exceptions import RetryError as core_RetryError
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.policy import thread
from google.gax.errors import RetryError as gax_RetryError
import grpc
from core.utils import b64json_decode, b64json_encode, Timer
TOPIC = 'pubsub-speed-test'
NOTIFY_PROJECT = '== OUR PROJECT =='
def receive(message):
decoded = b64json_decode(message.data)
message.ack()
took = time.time() - float(decoded.get('tstamp', 0))
print(f'Got message {decoded} {took:0.3f} sec')
if __name__ == '__main__':
client = pubsub_v1.SubscriberClient()
topic_path = client.topic_path(NOTIFY_PROJECT, TOPIC)
subs_path = client.subscription_path(NOTIFY_PROJECT, 'pubsub-worker')
try:
create_subscription(subs_path, topic_path)
except Exception:
pass
print(f'Subscription: topic={topic_path} subscription={subs_path}')
timer = Timer()
client.subscribe(subs_path, callback=receive)
time.sleep(120)
这是我们的core.utils中的Timer类:
import os
import time
import concurrent.futures
from google.api.core.exceptions import RetryError as core_RetryError
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.policy import thread
from google.gax.errors import RetryError as gax_RetryError
import grpc
from core.utils import b64json_decode, b64json_encode, Timer
TOPIC = 'pubsub-speed-test'
NOTIFY_PROJECT = '== OUR PROJECT =='
def publish(topic_path, message, client):
tstamp = f'{time.time():0.3f}'
data = {'tstamp': tstamp, 'msg': message}
future = client.publish(topic_path, b64json_encode(data, raw=True))
future.add_done_callback(lambda x: print(f'Publishing done callback: {data}'))
return future
if __name__ == '__main__':
client = pubsub_v1.PublisherClient()
topic_path = client.topic_path(NOTIFY_PROJECT, TOPIC)
num = 1500
pid = os.getpid()
fs = []
timer = Timer()
for i in range(0, num):
f = publish(topic_path, f'msg#{i}@{pid}', client)
fs.append(f)
print(f'Launched {len(fs)} futures in {timer.get_msecs()} ms')
good = bad = 0
for future in fs:
try:
data = future.result()
# print(f'result: {data}')
good += 1
except Exception as exc:
print(f'generated an exception: {exc} ({exc!r})')
bad += 1
took_ms = timer.get_msecs()
pct = bad / num
print(f'Done in {took_ms} ms, {num / took_ms * 1000:0.2f} qps ({bad} {pct:0.1%})')
此外,在我们的主代码中,我们偶尔会看到线程似乎无法恢复的IOErrors(除了可以忽略的DEADLINE_EXCEEDED)。为了解决这个问题,我们已经制定了政策,让我们抓住一些例外情况,并根据需要重新启动客户(虽然我们不确定它是否运作良好)
####################
# Time / Timing
####################
def utcnow():
"""Time now with tzinfo, mainly for mocking in unittests"""
return arrow.utcnow()
def relative_time():
"""Relative time for finding timedeltas depening on your python version"""
if sys.version_info[0] >= 3:
return time.perf_counter()
else:
return time.time()
class Timer:
def __init__(self):
self.reset()
def reset(self):
self.start_time = relative_time()
def get_msecs(self):
return int((relative_time() - self.start_time) * 1000)
def get_secs(self):
return int((relative_time() - self.start_time))
我们的版本:
class OurPolicy(thread.Policy):
_exception_caught = None
def __init__(self, *args, **kws):
logger.info(f'Initializing our PubSub Policy Wrapper') # noqa
return super(OurPolicy, self).__init__(*args, **kws)
def on_exception(self, exc):
# If this is DEADLINE_EXCEEDED, then we want to retry by returning None instead of raise-ing
deadline_exceeded = grpc.StatusCode.DEADLINE_EXCEEDED
code_value = getattr(exc, 'code', lambda: None)()
logger.error(f'Caught Exception in PubSub Policy Wrapper: code={code_value} exc={exc}')
if code_value == deadline_exceeded:
return
OurPolicy._exception_caught = exc
# will just raise exc
return super(OurPolicy, self).on_exception(exc)
[...later...]
while True:
time.sleep(1)
if OurPolicy._exception_caught:
exc = OurPolicy._exception_caught
OurPolicy._exception_caught = None
raise exc