ZooKeeper和基于Python的消息队列中的竞争条件

时间:2013-05-03 07:58:37

标签: python locking message-queue race-condition apache-zookeeper

我一直在评估ZooKeeper是一个简单的消息队列,我写了两个非常简单的脚本:mq feeder和mq consumer。下面的馈线正在将20个作业推送到队列中,然后监视队列状态(正在消耗的作业):

from kazoo.client import KazooClient

zk = KazooClient(hosts='xxx')
zk.start()

for i in xrange(20):
  zk.create("/queue/%s" % i, b"%s" % i)

while 1:
  print zk.get_children('/queue')

下面的消费者正在启动几次(在我的测试中最多3个并发进程)并且它接受作业列表,迭代它以找到未锁定的作业,处理它(睡眠随机数秒以模拟一些完成后,删除作业,然后删除锁:

from kazoo.client import KazooClient
from kazoo.exceptions import NodeExistsError
from time import sleep
import random

zk = KazooClient(hosts='xxx')
zk.start()
zk.ensure_path("/locks")
zk.ensure_path("/queue")

while 1:
  jobs = sorted(zk.get_children('/queue'))
  if jobs:
    for i in jobs:
      print "Checking job: %s" % i
      try:
        zk.create("/locks/%s" % i)
      except NodeExistsError:
        print "Job is locked, skipping!"
        pass
      else:
        print "Job is unlocked, processing."
        sleep(random.randrange(5))
        zk.delete("/queue/%s" % i)
        print "Deleted processed job, deleting the lock."
        zk.delete("/locks/%s" % i)
        pass
  else:
    print "There's no locks in the queue."
    pass

我看到的问题是,我无法跟踪的是消费者进程正在退出:

Traceback (most recent call last):
  File "zk_consumer.py", line 24, in <module>
    zk.delete("/queue/%s" % i)
  File "/Library/Python/2.7/site-packages/kazoo/client.py", line 1055, in delete
    return self.delete_async(path, version).get()
  File "/Library/Python/2.7/site-packages/kazoo/handlers/threading.py", line 107, in get
    raise self._exception
kazoo.exceptions.NoNodeError: ((), {})

而最后一个进程仍然永远检查单个作业,该作业仍保留在队列中,但始终处于锁定状态。显然,我在这里有一些逻辑错误,我认为会导致竞争状况,但我已经花了一些时间在它上面,我似乎无法发现它。我在这里做错了什么,或者ZooKeeper不是简单工作队列的可行解决方案?

1 个答案:

答案 0 :(得分:1)

你的代码很活泼。考虑这个序列,

T1                      T2
read queue/1     
                        read queue/1
                        write lock/1
                        delete queue/1
                        delete lock/1
write lock/1 
delete queue/1 (FAIL, no node!)

锁定后,您需要再次阅读以确保没有其他人删除了队列1。