据我所知,luigi.Target
可以存在,也可以不存在。
因此,如果存在luigi.Target
,则无法重新计算。
我正在寻找一种方法来强制重新计算任务,如果其中一个依赖项被修改,或者其中一个任务的代码发生了变化。
答案 0 :(得分:19)
您可以通过覆盖complete(...)
方法来实现目标。
The documentation for complete
is straightforward
只需实现一个检查约束的函数,如果要重新计算任务,则返回False
。
例如,要在更新依赖项时强制重新计算,您可以执行以下操作:
def complete(self):
"""Flag this task as incomplete if any requirement is incomplete or has been updated more recently than this task"""
import os
import time
def mtime(path):
return time.ctime(os.path.getmtime(path))
# assuming 1 output
if not os.path.exists(self.output().path):
return False
self_mtime = mtime(self.output().path)
# the below assumes a list of requirements, each with a list of outputs. YMMV
for el in self.requires():
if not el.complete():
return False
for output in el.output():
if mtime(output.path) > self_mtime:
return False
return True
当任何需求不完整或者任何修改比当前任务更新或当前任务的输出不存在时,这将返回False
。
检测代码何时更改更难。您可以使用类似的方案(检查mtime
),但除非每个任务都有自己的文件,否则它会被命中。
由于能够覆盖complete
,因此可以实现任何需要重新计算的逻辑。如果您需要针对许多任务的特定complete
方法,我建议对luigi.Task
进行子类化,在那里实现自定义complete
,然后从子类继承您的任务。 / p>
答案 1 :(得分:3)
我已经迟到了,但这里有一个mixin,它改进了支持多个输入/输出文件的已接受答案。
class MTimeMixin:
"""
Mixin that flags a task as incomplete if any requirement
is incomplete or has been updated more recently than this task
This is based on http://stackoverflow.com/a/29304506, but extends
it to support multiple input / output dependencies.
"""
def complete(self):
def to_list(obj):
if type(obj) in (type(()), type([])):
return obj
else:
return [obj]
def mtime(path):
return time.ctime(os.path.getmtime(path))
if not all(os.path.exists(out.path) for out in to_list(self.output())):
return False
self_mtime = min(mtime(out.path) for out in to_list(self.output()))
# the below assumes a list of requirements, each with a list of outputs. YMMV
for el in to_list(self.requires()):
if not el.complete():
return False
for output in to_list(el.output()):
if mtime(output.path) > self_mtime:
return False
return True
要使用它,您只需使用例如class MyTask(Mixin, luigi.Task)
声明您的课程。
答案 2 :(得分:2)
上述代码对我来说很有效,除非我相信正确的时间戳比较-O3
必须返回一个浮点而不是一个字符串("周四">"周一&# 34; ... [原文如此])。因此,简单地说,
mtime(path)
而不是:
def mtime(path):
return os.path.getmtime(path)
答案 3 :(得分:0)
关于下面发布的Shilad Sen的Mixin建议,请考虑以下示例:
# Filename: run_luigi.py
import luigi
from MTimeMixin import MTimeMixin
class PrintNumbers(luigi.Task):
def requires(self):
wreturn []
def output(self):
return luigi.LocalTarget("numbers_up_to_10.txt")
def run(self):
with self.output().open('w') as f:
for i in range(1, 11):
f.write("{}\n".format(i))
class SquaredNumbers(MTimeMixin, luigi.Task):
def requires(self):
return [PrintNumbers()]
def output(self):
return luigi.LocalTarget("squares.txt")
def run(self):
with self.input()[0].open() as fin, self.output().open('w') as fout:
for line in fin:
n = int(line.strip())
out = n * n
fout.write("{}:{}\n".format(n, out))
if __name__ == '__main__':
luigi.run()
其中MTimeMixin与上面的帖子相同。我使用
运行一次任务luigi --module run_luigi SquaredNumbers
然后我触摸文件numbers_up_to_10.txt并再次运行任务。然后Luigi提出以下投诉:
File "c:\winpython-64bit-3.4.4.6qt5\python-3.4.4.amd64\lib\site-packages\luigi-2.7.1-py3.4.egg\luigi\local_target.py", line 40, in move_to_final_destination
os.rename(self.tmp_path, self.path)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'squares.txt-luigi-tmp-5391104487' -> 'squares.txt'
这可能只是一个Windows问题,而不是Linux上的问题,“mv a b”可能只删除旧b(如果它已经存在且没有写保护)。我们可以使用Luigi / local_target.py的以下补丁修复此问题:
def move_to_final_destination(self):
if os.path.exists(self.path):
os.rename(self.path, self.path + time.strftime("_%Y%m%d%H%M%S.txt"))
os.rename(self.tmp_path, self.path)
另外,为了完整起见,Mixin再次作为一个单独的文件,来自另一篇文章:
import os
class MTimeMixin:
"""
Mixin that flags a task as incomplete if any requirement
is incomplete or has been updated more recently than this task
This is based on http://stackoverflow.com/a/29304506, but extends
it to support multiple input / output dependencies.
"""
def complete(self):
def to_list(obj):
if type(obj) in (type(()), type([])):
return obj
else:
return [obj]
def mtime(path):
return os.path.getmtime(path)
if not all(os.path.exists(out.path) for out in to_list(self.output())):
return False
self_mtime = min(mtime(out.path) for out in to_list(self.output()))
# the below assumes a list of requirements, each with a list of outputs. YMMV
for el in to_list(self.requires()):
if not el.complete():
return False
for output in to_list(el.output()):
if mtime(output.path) > self_mtime:
return False
return True