Luigi - 覆盖任务需要/输入

时间:2018-04-23 18:22:52

标签: python pipeline luigi

我正在使用luigi执行一系列任务,如下所示:

None

当我像这样开始整个工作流程时,这完全符合要求:

import tensorflow as tf

image = tf.random_uniform((900, 600, 4))    # image tensor, acquired anyhow e.g. from tf.data
cropped_size_for_this_run = [512, 512]      # crop dimensions, acquired anyhow

cropped_size = tf.placeholder_with_default(cropped_size_for_this_run, shape=[2])
cropped_image = tf.random_crop(image, size=[cropped_size[0], cropped_size[1], 4])

print(cropped_image.get_shape().as_list())
# [None, None, 4]

with tf.Session() as sess:
    # You can leave cropped_size with its default value assigned at runtime:
    res = sess.run(cropped_image)
    print(res.shape)
    # (512, 512, 4)

    # ... or you can specify a new one if you wish so:
    res = sess.run(cropped_image, feed_dict={cropped_size: [256, 256]})
    print(res.shape)
    # (256, 256, 4)

    # ... It would switch back to the default value if you don't feed one:
    res = sess.run(cropped_image)
    print(res.shape)
    # (512, 512, 4)

使用class Task1(luigi.Task): stuff = luigi.Parameter() def output(self): return luigi.LocalTarget('test.json') def run(self): with self.output().open('w') as f: f.write(stuff) class Task2(luigi.Task): stuff = luigi.Parameter() def requires(self): return Task1(stuff=self.stuff) def output(self): return luigi.LocalTarget('something-else.json') def run(self): with self.output().open('w') as f: f.write(stuff) 时,您还可以通过显式传递参数as per this example in the documentation来运行多个任务。

但是,在我的情况下,我还希望能够完全独立于luigi.build([Task2(stuff='stuff')]) 的业务逻辑而独立于工作流程。这适用于未实现luigi.buildas per this example的任务。

我的问题是,如何将此方法既作为工作流程的一部分,又作为其自身的一部分运行?显然,我可以添加一个新的私有方法,如Task2,它获取数据并返回结果,然后在requires中使用此方法,但它只是感觉应该被烘焙到框架,所以这让我觉得我误解了Luigi的最佳实践(仍在学习框架)。感谢任何建议,谢谢!

1 个答案:

答案 0 :(得分:1)

听起来像是您想要的dynamic requirements.使用该示例中显示的模式,您可以读取配置或传递带有任意数据的参数,并且yield仅基于您要执行的任务配置中的字段。

# tasks.py
import luigi
import json
import time


class Parameterizer(luigi.Task):
    params = luigi.Parameter() # Arbitrary JSON

    def output(self):
        return luigi.LocalTarget('./config.json')

    def run(self):
        with self.output().open('w') as f:
            json.dump(params, f)

class Task1(luigi.Task):
    stuff = luigi.Parameter()

    def output(self):
        return luigi.LocalTarget('{}'.format(self.stuff[:6]))

    def run(self):
        with self.output().open('w') as f:
            f.write(self.stuff)


class Task2(luigi.Task):
    stuff = luigi.Parameter()
    params = luigi.Parameter()


    def output(self):
        return luigi.LocalTarget('{}'.format(self.stuff[6:]))

    def run(self):

        config = Parameterizer(params=self.params)
        yield config

        with config.output().open() as f:
            parameters = json.load(f)

        if parameters["runTask1"]:
            yield Task1(stuff=self.stuff)
        else:
            pass
        with self.output().open('w') as f:
            f.write(self.stuff)

if __name__ == '__main__':
    cf_json = '{"runTask1": True}'

    print("Trying to run with Task1...")
    luigi.build([Task2(stuff="Task 1Task 2", params='{"runTask1":true}')], local_scheduler=True)

    time.sleep(10)

    cf_json = '{"runTask1": False}'

    print("Trying to run WITHOUT Task1...")
    luigi.build([Task2(stuff="Task 1Did just task 2", params='{"runTask1":false}')], local_scheduler=True)

(只需调用python tasks.py即可执行

我们可以轻松想象将多个参数映射到多个任务,或者在允许执行各种任务之前应用自定义测试。我们也可以将其重写为luigi.Config中的参数。

还要注意来自Task2的以下控制流:

    if parameters["runTask1"]:
        yield Task1(stuff=self.stuff)
    else:
        pass

在这里,我们可以运行一个替代任务,或动态调用任务,如在示例中从luigi回购中看到的那样。例如:

    if parameters["runTask1"]:
        yield Task1(stuff=self.stuff)
    else:
        # self.stuff is not automatically parsed to int, so this list comp is valid
        data_dependent_deps = [Task1(stuff=x) for x in self.stuff] 
        yield data_dependent_deps

这可能比简单的run_standalone()方法要复杂得多,但是我认为这与您在记录的luigi模式中寻找的内容最接近。

来源:https://luigi.readthedocs.io/en/stable/tasks.html?highlight=dynamic#dynamic-dependencies