我有一个DAG从Elasticsearch获取数据并摄取到数据湖中。第一个任务 BeginIngestion 在几个任务(每个资源一个)中打开,这些任务在更多任务中打开(每个分片一个)。提取分片后,数据将上传到S3,然后关闭到任务 EndIngestion ,然后执行任务 AuditIngestion 。
它正在正确执行,但现在所有任务都已成功执行,但“结束任务” EndIngestion 仍然没有状态。当我刷新网络服务器的页面时,DAG标记为失败。
此图像显示成功的上游任务,任务end_ingestion
没有状态,DAG标记为失败。
我还挖掘了任务实例细节并找到了
- Dagrun Running:任务实例的dagrun未处于“运行”状态,但状态为“失败”。
- 触发规则:任务的触发规则'all_success'要求所有上游任务都成功,但发现1个不成功。 upstream_tasks_state = {'failed':0,'upstream_failed':0,'跳过':0,'完成':49,'成功':49},upstream_task_ids = ['s3_finish_upload_ingestion_raichucrud_complain','s3_finish_upload_ingestion_raichucrud_interaction','s3_finish_upload_ingestion_raichucrud_company',' s3_finish_upload_ingestion_raichucrud_user”, 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_location', 's3_finish_upload_ingestion_raichucrud_companytoken', 's3_finish_upload_ingestion_raichucrud_indexevolution', 's3_finish_upload_ingestion_raichucrud_companyindex', 's3_finish_upload_ingestion_raichucrud_producttype', 's3_finish_upload_ingestion_raichucrud_categorycomplainsto', 's3_finish_upload_ingestion_raichucrud_companyresponsible', 's3_finish_upload_ingestion_raichucrud_category', 's3_finish_upload_ingestion_raichucrud_additionalfieldoption', 's3_finish_upload_ingestion_raichucrud_privatecontactconfiguration', 's3_finish_upload_ingestion_raichucrud_phone' , 's3_finish_upload_ingestion_raichucrud_presence', 's3_finish_upload_ingestion_raichucrud_responsible', 's3_finish_upload_ingestion_raichucrud_store', 's3_finish_upload_ingestion_raichucrud_socialprofile', 's3_finish_upload_ingestion_raichucrud_product', 's3_finish_upload_ingestion_raichucrud_macrorankingpresenceto', 's3_finish_upload_ingestion_raichucrud_macroinfoto', 's3_finish_upload_ingestion_raichucrud_raphoneproblem', 's3_finish_upload_ingestion_raichucrud_macrocomplainsto', 's3_finish_upload_ingestion_raichucrud_testimony', 's3_finish_upload_ingestion_raichucrud_additionalfield', 's3_finish_upload_ingestion_raichucrud_companypageblockitem',' s3_finish_upload_ingestion_raichucrud_rachatconfiguration ' 's3_finish_upload_ingestion_raichucrud_macrorankingitemto', 's3_finish_upload_ingestion_raichucrud_purchaseproduct', 's3_finish_upload_ingestion_raichucrud_rachatproblem', 's3_finish_upload_ingestion_raichucrud_role', 's3_finish_upload_ingestion_raichucrud_requestmoderation',' s3_f inish_upload_ingestion_raichucrud_categoryproblemto”, 's3_finish_upload_ingestion_raichucrud_companypageblock', 's3_finish_upload_ingestion_raichucrud_problemtype', 's3_finish_upload_ingestion_raichucrud_key', 's3_finish_upload_ingestion_raichucrud_macro', 's3_finish_upload_ingestion_raichucrud_url', 's3_finish_upload_ingestion_raichucrud_document', 's3_finish_upload_ingestion_raichucrud_transactionkey', 's3_finish_upload_ingestion_raichucrud_catprobitemcompany', 's3_finish_upload_ingestion_raichucrud_privatecontactinteraction', 's3_finish_upload_ingestion_raichucrud_categoryinfoto', 's3_finish_upload_ingestion_raichucrud_marketplace', 's3_finish_upload_ingestion_raichucrud_macroproblemto' ,'s3_finish_upload_ingestion_raichucrud_categoryrankingto','s3_finish_upload_ingestion_raichucrud_macrorankingto','s3_finish_upload_ingestion_raichucrud_categorypageto']
如您所见,“触发规则”字段表示其中一个任务处于“非成功状态”,但同时统计数据显示所有上游都标记为成功。
如果我重置数据库,它不会发生,但我不能为每次执行(每小时)重置它。我也不想重置它。
有人有光吗?
PS:我使用 LocalExecutor 在EC2实例(c4.xlarge)中运行。
[编辑]
我在调度程序日志中发现DAG处于死锁状态:
[2017-08-25 19:25:25,821] {models.py:4076} DagFileProcessor157 INFO - 死锁;标记运行失败
我想这可能是由于一些异常处理。