使用Python解析K8SOperator时出现气流Jinja模板错误

时间:2019-09-25 15:25:02

标签: python google-cloud-platform composer-php jinja2 airflow

我一直在努力反对composer-1.5.2-airflow-1.10.0中的Jinja模板错误。

从FTP传输每日文件之后,我需要DAG中的K8SOperator才能通过CI / CD从已发布的模板启动每日数据流作业,并且在作业开始后,我需要通过REST API检查作业状态。

在DAG中,我依靠Docker Python映像通过Python Google客户端运行GCP REST API。

以下是我的yaml描述符示例:

yaml_dataflow_create_job_from_template= """
containers:
  - name: load-data
    image: registry.xxx.yyyy.com/docker-images/python-gcloud:latest
    resources: 
      requests: 
        memory: "256Mi"
        cpu: "0.5"
      limits:
        memory: "2Gi"
        cpu: "2"
    args:
      - python
      - "-c"
      - |
        import os
        import io
        import json
        import google.cloud.storage.client as storage
        from googleapiclient.discovery import build

        service = build('dataflow', 'v1b3')

        GCSPATH="gs://{{params.bucket}}/templates/{{params.template_name}}"

        BODY = (
        '{'
        '    "jobName": "{{params.date}}-dataflow-job",'
        '    "parameters": {'
        '        "inputLogFile" : "gs://project-bucket/input/{{ params.yesterday_date }}.file.fr.log.gz",'
        '        "outputLogsByDayTable": "{{params.dataByDayTable}}",'
        '        "outputLogsByWeekTable": "{{params.dataByWeekTable}}",'
        '        "outputLogsByMonthTable": "{{params.dataByMonthTable}}",'
        '        "outputRawLogsTable": "{{params.dataRawTable}}"'
        '    },'
        '    "environment": {'
        '        "serviceAccountEmail": "{{params.sac}}",'
        '        "tempLocation": "{{params.templocation}}",'
        '        "zone": "{{params.zone}}",'
        '        "network": "shared",'
        '        "subnetwork": "{{params.network}}"'
        '    }'
        '}'
        )

        request = service.projects().locations().templates().launch(projectId="{{params.project_id}}", location="{{params.region}}", gcsPath=GCSPATH, body=json.loads(BODY))
        response = request.execute()

        print("send : ")
        print("response : "+ json.dumps(response))

        job_id = ""
        try:
           status = response['job']['currentState']
        except KeyError: 
           job_id = response['job']['id']

        import json
        import time

        print("Start :" + time.ctime())
        count = 0
        status = 'unknown'

        statusRequest = service.projects().locations().jobs().get(projectId="{{params.project_id}}", location="{{params.region}}", jobId=job_id)
        while (status == "unknown" and count <=200):
           time.sleep( 5 )
           statusResponse = statusRequest.execute()
           status = statusResponse['job']['currentState']
           print("Request#"+str(count)+" job("+job_id+") status :"+ status)
           count = count + 1
           try:
              status = response['job']['currentState']
           except KeyError: 
              status = 'unknown'
        print("Request#"+str(count)+" job("+job_id+") status :"+ status)
        print("End : " + time.ctime())

    volumeMounts:
      - name: google-cloud-key
        mountPath: /var/secrets/google
        readOnly: true
    env:
      - name: GOOGLE_APPLICATION_CREDENTIALS
        value: /var/secrets/google/credentials-keyfile.json

imagePullSecrets:
  - name: gitlab-key
volumes:
  - name: google-cloud-key
    secret:
      secretName: credentials-keyfile
"""

这是我的K8sOperator

dataflow_job = K8SJobOperator(task_id="dataflow_daily_job",
                                          location=location,
                                          project_id=host_project,
                                          cluster_name=host_cluster,
                                          name="dataflow_daily_job",
                                          gcp_conn_id='gcp_kub_runner',
                                          params={"yesterday_date": yesterdayds, "date": executiondate, "bucket": SEOLOG_SOURCING_BUCKET, "template_name" : "DataflowTmpl" , "sac" : 'service-account@project.iam.gserviceaccount.com', "job_name" : '{executiondate}-Dataflow-job'.format(executiondate=executiondate), "inputLogFile" : 'gs://project-bucket/input/{yesterdayds}.file.fr.log.gz'.format(yesterdayds=yesterdayds), "outputLogsByDayTable" : "project:dataset.DATA_BYDAY", "outputLogsByWeekTable" : "project:dataset.DATA_BYWEEK", "outputLogsByMonthTable" : "project:dataset.DATA_BYMONTH", "outputRawLogsTable" : "project:dataset.RAW_DATA" ,"templocation" : "gs://project-bucket/input/temp/", "region": Variable.get('REGION'), "zone": Variable.get('LOCATION'), "project_id" : Variable.get('DP_PROJECT'), "network" : "https://www.googleapis.com/compute/v1/{subnet}".format(dpsubnet=Variable.get('SUBNET'))},
                                          namespace='composer', descriptor=yaml_dataflow_create_job_from_template,
                                          timeout_s=60 * 15, dag=dag)

我注意到[2019-09-18 14:23:23,282] {logging_mixin.py:95} INFO - Error running jinja on yam,没有进一步解释。

该问题导致无法正确解释使用[]:字符的任何Python指令。无法识别任何whilefor循环以及任何ifelseelif指令,也无法从JSON响应中获取任何值或数组是不可能的。 由于Jinja错误,气流{{ ds}}{{ ds_nodash }}{{ yesterday }}{{ yesterday_ds_nodash }}也被破坏了。

几天后,我终于注意到通过替换JSON Python变量解决了此问题:

        BODY = (
        '{'
        '    "jobName": "{{params.date}}-dataflow-job",'
        '    "parameters": {'
        '        "inputLogFile" : "gs://project-bucket/input/{{ params.yesterday_date }}.file.fr.log.gz",'
        '        "outputLogsByDayTable": "{{params.outputLogsByDayTable}}",'
        '        "outputLogsByWeekTable": "{{params.outputLogsByWeekTable}}",'
        '        "outputLogsByMonthTable": "{{params.outputLogsByMonthTable}}",'
        '        "outputRawLogsTable": "{{params.outputRawLogsTable}}"'
        '    },'
        '    "environment": {'
        '        "serviceAccountEmail": "{{params.sac}}",'
        '        "tempLocation": "{{params.templocation}}",'
        '        "zone": "{{params.zone}}",'
        '        "network": "shared",'
        '        "subnetwork": "{{params.network}}"'
        '    }'
        '}'
        )

对此:

        BODY= '{'  \
        '    "jobName": "{{ params.date }}-dataflow-job",' \
        '    "parameters": {' \
        '        "inputLogFile" : "gs://project-bucket/input/{{ params.yesterday_date }}.input.fr.log.gz",' \
        '        "outputLogsByDayTable": "{{ params.outputLogsByDayTable }}",' \
        '        "outputLogsByWeekTable": "{{ params.outputLogsByWeekTable }}",' \
        '        "outputLogsByMonthTable": "{{ params.outputLogsByMonthTable }}",' \
        '        "outputRawLogsTable": "{{ params.outputRawLogsTable }}"' \
        '    },' \
        '    "environment": {' \
        '        "serviceAccountEmail": "{{ params.sac }}",' \
        '        "tempLocation": "{{ params.templocation }}",' \
        '        "zone": "{{ params.zone }}",' \
        '        "network": "shared",' \
        '        "subnetwork": "{{ params.network }}"' \
        '    }' \
        '}'

虽然在语法上是等效的。 ps:不要注意我在没有必要的情况下使用了params的事实……我只是在DAG的开头声明了它们一次:

yesterdayds = '{{ yesterday_ds_nodash }}'
executiondate = '{{ ds_nodash }}'
date = '{{ ds }}'

0 个答案:

没有答案