我一直在努力反对composer-1.5.2-airflow-1.10.0中的Jinja模板错误。
从FTP传输每日文件之后,我需要DAG中的K8SOperator才能通过CI / CD从已发布的模板启动每日数据流作业,并且在作业开始后,我需要通过REST API检查作业状态。
在DAG中,我依靠Docker Python映像通过Python Google客户端运行GCP REST API。
以下是我的yaml描述符示例:
yaml_dataflow_create_job_from_template= """
containers:
- name: load-data
image: registry.xxx.yyyy.com/docker-images/python-gcloud:latest
resources:
requests:
memory: "256Mi"
cpu: "0.5"
limits:
memory: "2Gi"
cpu: "2"
args:
- python
- "-c"
- |
import os
import io
import json
import google.cloud.storage.client as storage
from googleapiclient.discovery import build
service = build('dataflow', 'v1b3')
GCSPATH="gs://{{params.bucket}}/templates/{{params.template_name}}"
BODY = (
'{'
' "jobName": "{{params.date}}-dataflow-job",'
' "parameters": {'
' "inputLogFile" : "gs://project-bucket/input/{{ params.yesterday_date }}.file.fr.log.gz",'
' "outputLogsByDayTable": "{{params.dataByDayTable}}",'
' "outputLogsByWeekTable": "{{params.dataByWeekTable}}",'
' "outputLogsByMonthTable": "{{params.dataByMonthTable}}",'
' "outputRawLogsTable": "{{params.dataRawTable}}"'
' },'
' "environment": {'
' "serviceAccountEmail": "{{params.sac}}",'
' "tempLocation": "{{params.templocation}}",'
' "zone": "{{params.zone}}",'
' "network": "shared",'
' "subnetwork": "{{params.network}}"'
' }'
'}'
)
request = service.projects().locations().templates().launch(projectId="{{params.project_id}}", location="{{params.region}}", gcsPath=GCSPATH, body=json.loads(BODY))
response = request.execute()
print("send : ")
print("response : "+ json.dumps(response))
job_id = ""
try:
status = response['job']['currentState']
except KeyError:
job_id = response['job']['id']
import json
import time
print("Start :" + time.ctime())
count = 0
status = 'unknown'
statusRequest = service.projects().locations().jobs().get(projectId="{{params.project_id}}", location="{{params.region}}", jobId=job_id)
while (status == "unknown" and count <=200):
time.sleep( 5 )
statusResponse = statusRequest.execute()
status = statusResponse['job']['currentState']
print("Request#"+str(count)+" job("+job_id+") status :"+ status)
count = count + 1
try:
status = response['job']['currentState']
except KeyError:
status = 'unknown'
print("Request#"+str(count)+" job("+job_id+") status :"+ status)
print("End : " + time.ctime())
volumeMounts:
- name: google-cloud-key
mountPath: /var/secrets/google
readOnly: true
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /var/secrets/google/credentials-keyfile.json
imagePullSecrets:
- name: gitlab-key
volumes:
- name: google-cloud-key
secret:
secretName: credentials-keyfile
"""
这是我的K8sOperator
dataflow_job = K8SJobOperator(task_id="dataflow_daily_job",
location=location,
project_id=host_project,
cluster_name=host_cluster,
name="dataflow_daily_job",
gcp_conn_id='gcp_kub_runner',
params={"yesterday_date": yesterdayds, "date": executiondate, "bucket": SEOLOG_SOURCING_BUCKET, "template_name" : "DataflowTmpl" , "sac" : 'service-account@project.iam.gserviceaccount.com', "job_name" : '{executiondate}-Dataflow-job'.format(executiondate=executiondate), "inputLogFile" : 'gs://project-bucket/input/{yesterdayds}.file.fr.log.gz'.format(yesterdayds=yesterdayds), "outputLogsByDayTable" : "project:dataset.DATA_BYDAY", "outputLogsByWeekTable" : "project:dataset.DATA_BYWEEK", "outputLogsByMonthTable" : "project:dataset.DATA_BYMONTH", "outputRawLogsTable" : "project:dataset.RAW_DATA" ,"templocation" : "gs://project-bucket/input/temp/", "region": Variable.get('REGION'), "zone": Variable.get('LOCATION'), "project_id" : Variable.get('DP_PROJECT'), "network" : "https://www.googleapis.com/compute/v1/{subnet}".format(dpsubnet=Variable.get('SUBNET'))},
namespace='composer', descriptor=yaml_dataflow_create_job_from_template,
timeout_s=60 * 15, dag=dag)
我注意到[2019-09-18 14:23:23,282] {logging_mixin.py:95} INFO - Error running jinja on yam
,没有进一步解释。
该问题导致无法正确解释使用[
,]
或:
字符的任何Python指令。无法识别任何while
或for
循环以及任何if
,else
或elif
指令,也无法从JSON响应中获取任何值或数组是不可能的。
由于Jinja错误,气流{{ ds}}
,{{ ds_nodash }}
,{{ yesterday }}
和{{ yesterday_ds_nodash }}
也被破坏了。
几天后,我终于注意到通过替换JSON Python变量解决了此问题:
BODY = (
'{'
' "jobName": "{{params.date}}-dataflow-job",'
' "parameters": {'
' "inputLogFile" : "gs://project-bucket/input/{{ params.yesterday_date }}.file.fr.log.gz",'
' "outputLogsByDayTable": "{{params.outputLogsByDayTable}}",'
' "outputLogsByWeekTable": "{{params.outputLogsByWeekTable}}",'
' "outputLogsByMonthTable": "{{params.outputLogsByMonthTable}}",'
' "outputRawLogsTable": "{{params.outputRawLogsTable}}"'
' },'
' "environment": {'
' "serviceAccountEmail": "{{params.sac}}",'
' "tempLocation": "{{params.templocation}}",'
' "zone": "{{params.zone}}",'
' "network": "shared",'
' "subnetwork": "{{params.network}}"'
' }'
'}'
)
对此:
BODY= '{' \
' "jobName": "{{ params.date }}-dataflow-job",' \
' "parameters": {' \
' "inputLogFile" : "gs://project-bucket/input/{{ params.yesterday_date }}.input.fr.log.gz",' \
' "outputLogsByDayTable": "{{ params.outputLogsByDayTable }}",' \
' "outputLogsByWeekTable": "{{ params.outputLogsByWeekTable }}",' \
' "outputLogsByMonthTable": "{{ params.outputLogsByMonthTable }}",' \
' "outputRawLogsTable": "{{ params.outputRawLogsTable }}"' \
' },' \
' "environment": {' \
' "serviceAccountEmail": "{{ params.sac }}",' \
' "tempLocation": "{{ params.templocation }}",' \
' "zone": "{{ params.zone }}",' \
' "network": "shared",' \
' "subnetwork": "{{ params.network }}"' \
' }' \
'}'
虽然在语法上是等效的。 ps:不要注意我在没有必要的情况下使用了params的事实……我只是在DAG的开头声明了它们一次:
yesterdayds = '{{ yesterday_ds_nodash }}'
executiondate = '{{ ds_nodash }}'
date = '{{ ds }}'