Question

我正在写一个生产者/消费者，以满足我的工作需求。

通常有一个生产者线程从远程服务器获取一些日志，将其放入队列中。还有一个或多个消费者线程，它从队列中读取数据并做一些工作。之后，需要保存数据和结果（例如在sqlite3 db中）以供以后分析。

为了确保每个日志只能处理一次，每次在使用数据之前，我必须查询数据库以查看它是否已完成。我想知道是否有更好的方法来实现这一目标。如果有多个消费者线程，则数据库锁定似乎是个问题。

相关代码：

import Queue
import threading
import requests

out_queue = Queue.Queue()


class ProducerThread(threading.Thread):
    def __init__(self, out_queue):
        threading.Thread.__init__(self)
        self.out_queue = out_queue

    def run(self):
        while True:
            # Read remote log and put chunk in out_queue
            resp = requests.get("http://example.com")

            # place chunk into out queue and sleep for some time.
            self.out_queue.put(resp)
            time.sleep(10)


class ConsumerThread(threading.Thread):
    def __init__(self, out_queue):
        threading.Thread.__init__(self)
        self.out_queue = out_queue

    def run(self):
        while True:
            # consume the data.
            chunk = self.out_queue.get()

            # check whether chunk has been consumed before. query the database.
            flag = query_database(chunk)
            if not flag:
                do_something_with(chunk)

                # signals to queue job is done
                self.out_queue.task_done()

                # persist the data and other info insert to the database.
                data_persist()
            else:
                print("data has been consumed before.")


def main():

    # just one producer thread.
    t = ProducerThread(out_queue)
    t.setDaemon(True)
    t.start()

    for i in range(3):
        ct = ConsumerThread(out_queue)
        ct.setDaemon(True)
        ct.start()

    # wait on the queue until everything has been processed
    out_queue.join()

main()

Answer 1

如果没有重复/重复读取远程服务器的日志，则无需检查日志是否被多次处理，如Queue class implements all the required locking semantics，因此Queue.get（）确保特定项目只能由一个ConsumerThread获得。

如果日志可以重复（我猜不是），那么你应该在ProducerThread中进行检查（在将日志添加到队列之前），而不是在ConsumerThread中进行检查。这样，您就不需要考虑锁定了。

根据@ dofine对我对以下评论中的要求的理解的确认进行更新：

对于第2点和第3点，您可能需要轻量级持久性队列，例如queuelib中的FifoDiskQueue。说实话，我之前没有使用过这个lib，但我觉得它应该适合你。请查看lib。

对于第1点，我猜你可以通过使用任何（非内存）数据库和FifoDiskQueue的另一个队列来实现它：

如果第二个队列无法由一个使用者线程处理，则第二个队列用于立即重新排队日志。请参阅下面我的第一条评论，了解这个想法
db中有一个表。生产者线程总是向其添加新记录，但从不更新任何记录;并且消费者线程仅更新它从队列中选择的记录
以上逻辑，你永远不需要锁定表
在应用程序启动时（在启动使用者之前），您可能让生产者在数据库中查询因应用程序意外终止而在轨道中“丢失”的日志

此更新是在移动SO中键入的，因此扩展它有点不方便。如果需要，我会在有机会时再次更新

数据库中具有数据持久性的Python生产者/消费者？

1 个答案: