airflow http回调传感器

时间:2018-07-27 21:57:46

标签: callback airflow

我们的气流执行过程会发出http请求,以获取执行任务的服务。我们希望这些服务在完成任务时让气流知道,因此我们将向其发送回调URL,以在任务完成时调用它们。但是,我似乎找不到回调传感器。人们如何正常处理此事?

1 个答案:

答案 0 :(得分:4)

Airflow中没有回调或Webhook传感器之类的东西。传感器的定义取自文档:

  

传感器是一种特定类型的操作员,它将一直运行直到满足特定条件为止。示例包括特定文件在HDFS或S3中的登陆,在Hive中出现的分区或一天中的特定时间。传感器从BaseSensorOperator派生,并在指定的poke_interval上运行poke方法,直到返回True。

这意味着传感器是在外部系统上执行轮询行为的操作员。从这种意义上说,您的外部服务应该有一种方法来保持每个已执行任务的状态(内部还是外部),以便轮询传感器可以检查该状态。

这样,您可以使用例如airflow.operators.HttpSensor来轮询HTTP端点,直到满足条件为止。甚至更好的是,编写自己的自定义传感器,使您有机会进行更复杂的处理并保持状态。

否则,如果服务在存储系统中输出数据,则可以使用例如轮询数据库的传感器。我相信你会明白的。

我将附加一个自定义运算符示例,该示例是为与Apache Livy API集成而编写的。传感器执行两项操作:a)通过REST API提交Spark作业,b)等待作业完成。

运算符扩展了 SimpleHttpOperator ,同时实现了 HttpSensor ,从而结合了这两种功能。

class LivyBatchOperator(SimpleHttpOperator):
"""
Submits a new Spark batch job through
the Apache Livy REST API.

"""

template_fields = ('args',)
ui_color = '#f4a460'

@apply_defaults
def __init__(self,
             name,
             className,
             file,
             executorMemory='1g',
             driverMemory='512m',
             driverCores=1,
             executorCores=1,
             numExecutors=1,
             args=[],
             conf={},
             timeout=120,
             http_conn_id='apache_livy',
             *arguments, **kwargs):
    """
    If xcom_push is True, response of an HTTP request will also
    be pushed to an XCom.
    """
    super(LivyBatchOperator, self).__init__(
        endpoint='batches', *arguments, **kwargs)

    self.http_conn_id = http_conn_id
    self.method = 'POST'
    self.endpoint = 'batches'
    self.name = name
    self.className = className
    self.file = file
    self.executorMemory = executorMemory
    self.driverMemory = driverMemory
    self.driverCores = driverCores
    self.executorCores = executorCores
    self.numExecutors = numExecutors
    self.args = args
    self.conf = conf
    self.timeout = timeout
    self.poke_interval = 10

def execute(self, context):
    """
    Executes the task
    """

    payload = {
        "name": self.name,
        "className": self.className,
        "executorMemory": self.executorMemory,
        "driverMemory": self.driverMemory,
        "driverCores": self.driverCores,
        "executorCores": self.executorCores,
        "numExecutors": self.numExecutors,
        "file": self.file,
        "args": self.args,
        "conf": self.conf
    }
    print payload
    headers = {
        'X-Requested-By': 'airflow',
        'Content-Type': 'application/json'
    }

    http = HttpHook(self.method, http_conn_id=self.http_conn_id)

    self.log.info("Submitting batch through Apache Livy API")

    response = http.run(self.endpoint,
                        json.dumps(payload),
                        headers,
                        self.extra_options)

    # parse the JSON response
    obj = json.loads(response.content)

    # get the new batch Id
    self.batch_id = obj['id']

    log.info('Batch successfully submitted with Id %s', self.batch_id)

    # start polling the batch status
    started_at = datetime.utcnow()
    while not self.poke(context):
        if (datetime.utcnow() - started_at).total_seconds() > self.timeout:
            raise AirflowSensorTimeout('Snap. Time is OUT.')

        sleep(self.poke_interval)

    self.log.info("Batch %s has finished", self.batch_id)

def poke(self, context):
    '''
    Function that the sensors defined while deriving this class should
    override.
    '''

    http = HttpHook(method='GET', http_conn_id=self.http_conn_id)

    self.log.info("Calling Apache Livy API to get batch status")

    # call the API endpoint
    endpoint = 'batches/' + str(self.batch_id)
    response = http.run(endpoint)

    # parse the JSON response
    obj = json.loads(response.content)

    # get the current state of the batch
    state = obj['state']

    # check the batch state
    if (state == 'starting') or (state == 'running'):
        # if state is 'starting' or 'running'
        # signal a new polling cycle
        self.log.info('Batch %s has not finished yet (%s)',
                      self.batch_id, state)
        return False
    elif state == 'success':
        # if state is 'success' exit
        return True
    else:
        # for all other states
        # raise an exception and
        # terminate the task
        raise AirflowException(
            'Batch ' + str(self.batch_id) + ' failed (' + state + ')')

希望这会对您有所帮助。