我要使用Luigi定义具有两个“阶段”的工作流程:
因此,我首先对luigi.contrib.postgres.PostgresQuery
进行了子类化,并按照doc中所述重写了主机,数据库,用户等。
之后,如何将查询结果传递到工作流中的下一个任务?这样的下一个任务已经在requires
方法中指定了必须实例化并返回上述类。
我的代码:
class MyData(luigi.contrib.postgres.PostgresQuery):
host = 'my_host'
database = 'my_db'
user = 'my_user'
password = 'my_pass'
table = 'my_table'
query = 'select *'
class DoWhateverWithMyData(luigi.Task):
def requires(self):
return MyData()
还需要什么?
谢谢!
编辑1
看看Luigi的代码,似乎PostgresQuery
的run
方法没有执行查询结果;我的意思是,查询已运行,仅此而已:
class PostgresQuery(rdbms.Query):
"""
Template task for querying a Postgres compatible database
Usage:
Subclass and override the required `host`, `database`, `user`, `password`, `table`, and `query` attributes.
Optionally one can override the `autocommit` attribute to put the connection for the query in autocommit mode.
Override the `run` method if your use case requires some action with the query result.
Task instances require a dynamic `update_id`, e.g. via parameter(s), otherwise the query will only execute once
To customize the query signature as recorded in the database marker table, override the `update_id` property.
"""
def run(self):
connection = self.output().connect()
connection.autocommit = self.autocommit
cursor = connection.cursor()
sql = self.query
logger.info('Executing query from task: {name}'.format(name=self.__class__))
cursor.execute(sql)
# Update marker table
self.output().touch(connection)
# commit and close connection
connection.commit()
connection.close()
def output(self):
"""
Returns a PostgresTarget representing the executed query.
Normally you don't override this.
"""
return PostgresTarget(
host=self.host,
database=self.database,
user=self.user,
password=self.password,
table=self.table,
update_id=self.update_id
)
我认为我必须通过自己的实现来扩展此类。
编辑2
我发现this链接解释的内容与上述修改相同。