Question

我现在有大量要处理的文档，并且正在使用Python RQ来并行化任务。

我希望在每个文档上执行不同的操作时完成一系列工作。例如：A - ＆gt; B - ＆gt; C表示在A完成后，将文档传递给A函数，然后转到B并持续C。

但是，Python RQ似乎并不能很好地支持管道。

这是一个简单但有点脏的做法。总之，沿着管道的每个函数都以嵌套的方式调用它的下一个函数。

例如，对于管道A - ＆gt; B - ＆gt; C。

在顶层，一些代码是这样编写的：

q.enqueue(A, the_doc)

其中q是Queue实例，而函数A中有代码如下：

q.enqueue(B, the_doc)

在B中，有类似的东西：

q.enqueue(C, the_doc)

还有比这更优雅的其他方式吗？例如， ONE 函数中的一些代码：

q.enqueue(A, the_doc) q.enqueue(B, the_doc, after = A) q.enqueue(C, the_doc, after= B)

depends_on参数是最符合我要求的参数，但运行类似：

A_job = q.enqueue(A, the_doc) q.enqueue(B, depends_on=A_job )

不起作用。执行q.enqueue(B, depends_on=A_job )后立即执行A_job = q.enqueue(A, the_doc)。到B排队时，A的结果可能没有准备好，因为需要时间来处理。

PS：

如果Python RQ不是很擅长这一点，我可以使用Python中的其他工具来达到同样的目的：

循环并行化
管道处理支持

Answer 1

当B入队时，A的结果可能没有准备好，因为需要时间来处理。

当你最初发布这个问题时，我不确定这是否真的如此，但无论如何，现在情况并非如此。实际上，depends_on功能完全针对您描述的工作流程。

这两个函数确实是连续执行的。

A_job = q.enqueue(A, the_doc)
B_job = q.enqueue(B, depends_on=A_job )

但是在B完成之前，工作人员不会执行A。在A_job成功执行之前，B.status == 'deferred'。在A.status == 'finished'后，B将开始运行。

这意味着B和C可以访问和操作其依赖关系的结果，如下所示：

import time
from rq import Queue, get_current_job
from redis import StrictRedis

conn = StrictRedis()
q = Queue('high', connection=conn)

def A():
    time.sleep(100)
    return 'result A'

def B():
    time.sleep(100)
    current_job = get_current_job(conn)
    a_job_id = current_job.dependencies[0].id
    a_job_result = q.fetch_job(a_job_id).result
    assert a_job_result == 'result A'
    return a_job_result + ' result B'


def C():
    time.sleep(100)
    current_job = get_current_job(conn)
    b_job_id = current_job.dependencies[0].id
    b_job_result = q.fetch_job(b_job_id).result
    assert b_job_result == 'result A result B'
    return b_job_result + ' result C'

工人最终会打印'result A result B result C'。

此外，如果队列中有许多作业，并且B可能在执行前等待一段时间，您可能希望显着增加result_ttl或使其result_ttl=-1无限期。否则，在为result_ttl设置了许多秒后，将清除A的结果，在这种情况下，B将无法再访问它并返回所需的结果。

然而，设置result_ttl=-1具有重要的记忆意义。这意味着您的作业结果永远不会被自动清除，内存将按比例增长，直到您从redis手动删除这些结果。

Answer 2

depends_on参数是最符合我要求的参数，但是，   运行类似的东西：

A_job = q.enqueue（A，the_doc）q.enqueue（B，depends_on = A_job）

不起作用。因为q.enqueue（B，depends_on = A_job）立即执行   在执行A_job = q.enqueue（A，the_doc）之后。到时候是B.   排队，A的结果可能没有准备好，因为它需要时间   过程

对于这种情况，q.enqueue（B，depends_on = A_job）将在A_job完成后运行。如果结果没有准备好，q.enqueue（B，depends_on = A_job）将等待它准备好。

它不支持开箱即用，但使用其他技术是可能的。

在我的情况下，我使用缓存来跟踪链中的上一个作业，所以当我们想要将一个新函数排入队列（以后运行）时，我们可以在调用enqueue（）时正确设置其'depends_on'参数

请注意使用其他参数进行排队：'timeout，result_ttl，ttl'。这些都是因为我在rq上运行很长时间的工作。您可以在python rq docs中引用它们。

我使用了源自python rq的django_rq.enqueue（）

    # main.py
    def process_job():
        ...

        # Create a cache key for every chain of methods you want to call.
        # NOTE: I used this for web development, in your case you may want
        # to use a variable or a database, not caching

        # Number of time to cache and keep the results in rq
        TWO_HRS = 60 * 60 * 2

        cache_key = 'update-data-key-%s' % obj.id
        previous_job_id = cache.get(cache_key)
        job = django_rq.enqueue(update_metadata,
                                campaign=campaign,
                                list=chosen_list,
                                depends_on=previous_job_id,
                                timeout=TWO_HRS,
                                result_ttl=TWO_HRS,
                                ttl=TWO_HRS)

        # Set the value for the most recent finished job, so the next function
        # in the chain can set the proper value for 'depends_on'
        cache.set(token_key, job.id, TWO_HRS)

    # utils.py
    def update_metadata(campaign, list):
        # Your code goes here to update the campaign object with the list object
        pass

'depends_on' - 来自rq docs：

depends_on - 指定必须完成的另一个作业（或作业ID）在此工作排队之前

Python RQ：回调模式

2 个答案: