Celery:访问链中的所有先前结果

时间:2015-04-09 19:52:10

标签: python redis celery

所以基本上我有一个非常复杂的工作流程,看起来类似于:

>>> res = (add.si(2, 2) | add.s(4) | add.s(8))()
>>> res.get()
16

之后我走上结果链并收集所有个人结果是相当微不足道的:

>>> res.parent.get()
8

>>> res.parent.parent.get()
4

我的问题是,如果我的第三个任务取决于知道第一个任务的结果怎么办,但在这个例子中只收到第二个任务的结果呢?

此外,链条很长,结果也不小,因此只是通过输入会不必要地污染结果存储。哪个是Redis,所以使用RabbitMQ,ZeroMQ时的限制......不适用。

3 个答案:

答案 0 :(得分:2)

我为每个链分配一个作业ID,并通过将数据保存在数据库中来跟踪此作业。

启动队列

if __name__ == "__main__":
  # Generate unique id for the job
  job_id = uuid.uuid4().hex
  # This is the root parent
  parent_level = 1
  # Pack the data. The last value is your value to add
  parameters = job_id, parent_level, 2
  # Build the chain. I added an clean task that removes the data
  # created during the process (if you want it)
  add_chain = add.s(parameters, 2) | add.s(4) | add.s(8)| clean.s()
  add_chain.apply_async()

现在的任务

#Function for store the result. I used sqlalchemy (mysql) but you can
# change it for whatever you want (distributed file system for example)
@inject.params(entity_manager=EntityManager)
def save_result(job_id, level, result, entity_manager):
  r = Result()
  r.job_id = job_id
  r.level = level
  r.result = result
  entity_manager.add(r)
  entity_manager.commit()

#Restore a result from one parent
@inject.params(entity_manager=EntityManager)
def get_result(job_id, level, entity_manager):
  result = entity_manager.query(Result).filter_by(job_id=job_id, level=level).one()
  return result.result

#Clear the data or do something with the final result
@inject.params(entity_manager=EntityManager)
  def clear(job_id, entity_manager):
  entity_manager.query(Result).filter_by(job_id=job_id).delete()

@app.task()
def add(parameters, number):
  # Extract data from parameters list
  job_id, level, other_number = parameters

  #Load result from your second parent (level - 2)
  #For level 3 parent level - 3 and so on
  #second_parent_result = get_result(job_id, level - 2)

  # do your stuff, I guess you want to add numbers
  result = number + other_number
  save_result(job_id, level, result)

  #Return the result of the sum or anything you want, but you have to send something because the "add" function expects 3 values
  #Of course your should return the actual job and increment the parent level
  return job_id, level + 1, result

@app.task()
def clean(parameters):
  job_id, level, result = parameters
  #Do something with final result or not
  #Clear the data
  clear(job_id)

我使用entity_manager来管理数据库操作。我的实体管理器使用sql alchemy和mysql。我还用了一张桌子"结果"存储部分结果。这个部分应该更改为您最好的存储系统(或者如果mysql可以使用,则使用此部分)

from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
import inject

class EntityManager():

  session = None

  @inject.params(config=Configuration)
  def __init__(self, config):
    conf = config['persistence']
    uri = conf['driver'] + "://" + conf['username'] + ":@" + conf['host'] + "/" + conf['database']

    engine = create_engine(uri, echo=conf['debug'])

    Session = sessionmaker(bind=engine)
    self.session = Session()

  def query(self, entity_type):
    return self.session.query(entity_type)

  def add(self, entity):
    return self.session.add(entity)

  def flush(self):
    return self.session.flush()

  def commit(self):
    return self.session.commit()

class Configuration:
  def __init__(self, params):
    f = open(os.environ.get('PYTHONPATH') + '/conf/config.yml')
    self.configMap = yaml.safe_load(f)
    f.close()

  def __getitem__(self, key: str):
    return self.configMap[key]

class Result(Base):
  __tablename__ = 'result'

  id = Column(Integer, primary_key=True)
  job_id = Column(String(255))
  level = Column(Integer)
  result = Column(Integer)

  def __repr__(self):
    return "<Result (job='%s', level='%s', result='%s')>" % (self.job_id, str(self.level), str(self.result))

我使用了包注入来获取依赖注入器。 inject包将重用该对象,因此您可以在每次需要时注入对数据库的访问权限,而无需担心连接。

类配置是在配置文件中加载数据库访问数据。您可以替换它并使用静态数据(硬编码的地图)进行测试。

更改适合您的任何其他内容的依赖注入。这只是我的解决方案。我刚刚加入它进行快速测试。

这里的关键是将部分结果保存在我们的队列系统中,并在任务中返回数据以访问这些结果(job_id和父级别)。您将发送这个额外(但很小)的数据,这是一个指向真实数据的地址(job_id +父级别)(一些大的东西)。

此解决方案我在我的软件中使用

答案 1 :(得分:1)

一个简单的解决方法是将任务结果存储在列表中并在任务中使用它们。

from celery import Celery, chain
from celery.signals import task_success

results = []

app = Celery('tasks', backend='amqp', broker='amqp://')


@task_success.connect()
def store_result(**kwargs):
    sender = kwargs.pop('sender')
    result = kwargs.pop('result')
    results.append((sender.name, result))


@app.task
def add(x, y):
    print("previous results", results)
    return x + y

现在,在您的链中,可以按任何顺序从任何任务访问所有以前的结果。

答案 2 :(得分:1)

也许您的设置过于复杂,但我喜欢使用group结合noop任务来完成类似的操作。我是这样做的,因为我想要突出显示在我的管道中仍然是同步的区域(通常可以将它们移除)。

使用与您的示例类似的东西,我从一组看起来像这样的任务开始:

tasks.py

from celery import Celery

app = Celery('tasks', backend="redis", broker='redis://localhost')


@app.task
def add(x, y):
        return x + y


@app.task
def xsum(elements):
    return sum(elements)


@app.task
def noop(ignored):
    return ignored

通过这些任务,我使用组创建一个链来控制取决于同步结果的结果:

In [1]: from tasks import add,xsum,noop
In [2]: from celery import group

# First I run the task which I need the value of later, then I send that result to a group where the first task does nothing and the other tasks are my pipeline.
In [3]: ~(add.si(2, 2) | group(noop.s(),  add.s(4) | add.s(8)))
Out[3]: [4, 16]

# At this point I have a list where the first element is the result of my original task and the second element has the result of my workflow.
In [4]: ~(add.si(2, 2) | group(noop.s(),  add.s(4) | add.s(8)) | xsum.s())
Out[4]: 20

# From here, things can go back to a normal chain
In [5]: ~(add.si(2, 2) | group(noop.s(),  add.s(4) | add.s(8)) | xsum.s() | add.s(1) | add.s(1))
Out[5]: 22

我希望这很有用!