谷歌pubsub消息的处理速度变慢(python)

时间:2017-09-28 00:44:01

标签: python google-kubernetes-engine grpc google-cloud-pubsub concurrent.futures

自从切换到基于线程/回调的最新python库以来,我们的pubsub生产者和消费者之间的关系一直很慢。我们对google的pubsub相对较新,我们想知道其他人是否在最近的图书馆更改后遇到类似的问题,或者知道我们可能错过的设置。

从推送消息到3个工作人员消耗的时间(在python中),我们看到意外的减速。我们的处理程序需要花费很少的时间来处理每个请求,我们还在运行处理程序之前更改了代码以调用message.ack()。例如。 self.sub_client.subscribe(subscription_path, callback=self.message_callback)。这些消息不是重复的。当我们将它们排队时,我们会记录msecs中的时间,以了解它们在队列中的时间。

for pod in worker-staging-deployment-1003989621-2mx0n worker-staging-deployment-1003989621-b6llt worker-staging-deployment-1003989621-lx4gq; do echo == $pod ==; kubectl logs $pod -c fra-worker | grep 'ACK start'; done
== fra-worker-staging-deployment-1003989621-2mx0n ==                                        
[2017-09-25 23:29:03,987] {pubsub.py:147} INFO - ACK start: 22 ms for 1506382143.88 (0.10699987411499023 secs)                                                                                                                                                                                                                                                                                                      
[2017-09-25 23:29:04,966] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382144.767 (0.19900012016296387 secs)
[2017-09-25 23:29:14,708] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382144.219 (10.488999843597412 secs)
[2017-09-25 23:29:17,706] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382147.229 (10.476999998092651 secs)
[2017-09-25 23:29:37,767] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.782 (32.984999895095825 secs)
[2017-09-25 23:30:00,649] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382146.257 (54.39199995994568 secs)
== fra-worker-staging-deployment-1003989621-b6llt ==
[2017-09-25 23:29:04,083] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.957 (0.12599992752075195 secs)
[2017-09-25 23:29:05,261] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382144.916 (0.3450000286102295 secs)
[2017-09-25 23:29:15,703] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.336 (11.367000102996826 secs)                                                                                                                                                                                                                                                                                                       
[2017-09-25 23:29:25,630] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.812 (21.818000078201294 secs)
[2017-09-25 23:29:38,706] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382144.49 (34.21600008010864 secs)
[2017-09-25 23:30:01,752] {pubsub.py:147} INFO - ACK start: 3 ms for 1506382146.696 (55.055999994277954 secs)                                                                                                                                                                                                                                                                                                       
== fra-worker-staging-deployment-1003989621-lx4gq ==                                                                                                                                                                                                                                                                                                                                                                
[2017-09-25 23:29:03,342] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382142.889 (0.4530000686645508 secs)                                                                                                                                               
[2017-09-25 23:29:04,955] {pubsub.py:147} INFO - ACK start: 2 ms for 1506382143.907 (1.0469999313354492 secs)                                                                                                                                                                                                                   
[2017-09-25 23:29:14,704] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382143.888 (10.815999984741211 secs)                                                                                                                                                                                                                                                                                                       
[2017-09-25 23:29:17,705] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382147.205 (10.5 secs)                                                                                                                                                                                                                                                                                                                     
[2017-09-25 23:29:37,767] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.197 (33.5699999332428 secs)                                            
[2017-09-25 23:29:59,733] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382144.269 (55.46399998664856 secs)                                                                                                                                                                                                                                                                                                        
[2017-09-25 23:31:18,870] {pubsub.py:147} INFO - ACK start: 1 ms for 1506382146.924 (131.94599986076355 secs)                                                                                                                                                                                   

最初看起来消息从排队到阅读花费的时间很短,但随后他们开始播放,之后就好像有10秒,32秒,55秒的延迟。 (并且这些不是重复的,因此由于失败的ack而不是重试逻辑。)

我们写了一个小测试,它可以快速处理少量发送者和消息,但是一旦我们将消息提升到1500而发送者增加到3,我们就会看到发布调用通常会返回一个带有异常结果的未来{{1结果显示大约500条消息/秒,但错误率> 10%的"PublishError('Some messages were not successfully published.次调用会抛出此异常

publish()

虽然我们的发送者在3秒钟内完成(这些并行运行)但是工作人员正在收到20秒前入队的消息

Done in 2929 ms, 512.12 qps (154 10.3%)
Done in 2901 ms, 517.06 qps (165 11.0%)
Done in 2940 ms, 510.20 qps (217 14.5%)

这是工人/听众:

Got message {'tstamp': '1506557436.988', 'msg': 'msg#393@982'} 20.289 sec

发件人/发布商:

import time

from google.api.core.exceptions import RetryError as core_RetryError
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.policy import thread
from google.gax.errors import RetryError as gax_RetryError
import grpc

from core.utils import b64json_decode, b64json_encode, Timer


TOPIC = 'pubsub-speed-test'
NOTIFY_PROJECT = '== OUR PROJECT =='


def receive(message):
    decoded = b64json_decode(message.data)
    message.ack()
    took = time.time() - float(decoded.get('tstamp', 0))
    print(f'Got message {decoded} {took:0.3f} sec')


if __name__ == '__main__':
    client = pubsub_v1.SubscriberClient()
    topic_path = client.topic_path(NOTIFY_PROJECT, TOPIC)
    subs_path = client.subscription_path(NOTIFY_PROJECT, 'pubsub-worker')

    try:
        create_subscription(subs_path, topic_path)
    except Exception:
        pass
    print(f'Subscription: topic={topic_path} subscription={subs_path}')

    timer = Timer()
    client.subscribe(subs_path, callback=receive)
    time.sleep(120)

这是我们的core.utils中的Timer类:

import os
import time
import concurrent.futures

from google.api.core.exceptions import RetryError as core_RetryError
from google.cloud import pubsub_v1
from google.cloud.pubsub_v1.subscriber.policy import thread
from google.gax.errors import RetryError as gax_RetryError
import grpc

from core.utils import b64json_decode, b64json_encode, Timer

TOPIC = 'pubsub-speed-test'
NOTIFY_PROJECT = '== OUR PROJECT =='


def publish(topic_path, message, client):
    tstamp = f'{time.time():0.3f}'
    data = {'tstamp': tstamp, 'msg': message}
    future = client.publish(topic_path, b64json_encode(data, raw=True))
    future.add_done_callback(lambda x: print(f'Publishing done callback: {data}'))
    return future


if __name__ == '__main__':
    client = pubsub_v1.PublisherClient()
    topic_path = client.topic_path(NOTIFY_PROJECT, TOPIC)

    num = 1500
    pid = os.getpid()
    fs = []
    timer = Timer()
    for i in range(0, num):
        f = publish(topic_path, f'msg#{i}@{pid}', client)
        fs.append(f)
    print(f'Launched {len(fs)} futures in {timer.get_msecs()} ms')

    good = bad = 0
    for future in fs:
        try:
            data = future.result()
            # print(f'result: {data}')
            good += 1
        except Exception as exc:
            print(f'generated an exception: {exc} ({exc!r})')
            bad += 1
    took_ms = timer.get_msecs()
    pct = bad / num
    print(f'Done in {took_ms} ms, {num / took_ms * 1000:0.2f} qps ({bad} {pct:0.1%})')

此外,在我们的主代码中,我们偶尔会看到线程似乎无法恢复的IOErrors(除了可以忽略的DEADLINE_EXCEEDED)。为了解决这个问题,我们已经制定了政策,让我们抓住一些例外情况,并根据需要重新启动客户(虽然我们不确定它是否运作良好)

####################
# Time / Timing
####################


def utcnow():
    """Time now with tzinfo, mainly for mocking in unittests"""
    return arrow.utcnow()


def relative_time():
    """Relative time for finding timedeltas depening on your python version"""
    if sys.version_info[0] >= 3:
        return time.perf_counter()
    else:
        return time.time()


class Timer:
    def __init__(self):
        self.reset()

    def reset(self):
        self.start_time = relative_time()

    def get_msecs(self):
        return int((relative_time() - self.start_time) * 1000)

    def get_secs(self):
        return int((relative_time() - self.start_time))

我们的版本:

class OurPolicy(thread.Policy):
    _exception_caught = None

    def __init__(self, *args, **kws):
        logger.info(f'Initializing our PubSub Policy Wrapper')  # noqa                                                                                                                                                                                                                                              
        return super(OurPolicy, self).__init__(*args, **kws)

    def on_exception(self, exc):
        # If this is DEADLINE_EXCEEDED, then we want to retry by returning None instead of raise-ing                                                                                                                                                                                                                
        deadline_exceeded = grpc.StatusCode.DEADLINE_EXCEEDED
        code_value = getattr(exc, 'code', lambda: None)()
        logger.error(f'Caught Exception in PubSub Policy Wrapper: code={code_value} exc={exc}')
        if code_value == deadline_exceeded:
            return
        OurPolicy._exception_caught = exc
        # will just raise exc                                                                                                                                                                                                                                                                                       
        return super(OurPolicy, self).on_exception(exc)

[...later...]

                while True:
                    time.sleep(1)
                    if OurPolicy._exception_caught:
                        exc = OurPolicy._exception_caught
                        OurPolicy._exception_caught = None
                        raise exc

0 个答案:

没有答案