如何将云作曲家与计算引擎集成

时间:2020-07-28 19:47:04

标签: google-cloud-platform google-cloud-composer

大家好, 我是GCP的新手(从Aws转到GCP),我有一个la脚的问题(请原谅)。我们正在使用GCP构建传统的EDW。作为调度程序的一部分,我们有云编写器,并且所有代码都位于Compute Engine中(例如AWS中的Ec2实例)。

我如何设置工作流程以通过Compute Engine运行我的工作?或实现相同的最佳解决方案是什么?

有关我们管道的更多信息: 管道1:从sql db(legacy)中提取数百万行,执行一些etl逻辑[清理,添加新列,删除列,增加case列值等],最后加载到redshift

管道2:从Googlesheets读取数据,执行上述etl逻辑并加载到不同的redshift表中。

管道3:从Google API读取数据,执行清理,插入redshift等。

如何最好地使用Cloud Composer编写ETL工作流程。

任何帮助都非常感谢!

----------PROJECT STRUCTURE & REQUIREMENTS------------
In my compute Engine I have project like :

    /home/ubunutu/projects/project1
        /venv
        /src/job1.py ( reads googlesheets and loads into cloudsql)
        /src/job2.py ( Reads Google Adwords API, do some cleaning, modifying attributes and load into cloudsql)
    
    
    /home/ubunutu/projects/project2
        /venv
        /src/job1.py ( Read file from GCS, perform cleaning,adding/remving columns and load into cloudsql)
       /src/job2.py ( Reads data from a cloudsql table A and perform some modifications and loads into cloudsql table B)
    
    
    
    
     Now in composer, how do I orchestrate the complete work flow? Python jobs sits in Compute engine and I need to execute them.
    
    The reason Why we use compute Engine is to perform some in-memory opearions like reading data in dataframe, do some group by, create new columns, creating temporary files and so on.
    
    or what would be your suggestions?
    As like moving the whole sandbox to composer's /data directory as like,
    /data/projects/project1
        /venv
        /src/job1.py ( reads googlesheets and loads into cloudsql)
        /src/job2.py ( Reads Google Adwords API, do some cleaning, modifying attributes and load into cloudsql)
    
    
    /data/projects/project2
        /venv
        /src/job1.py ( Read file from GCS, perform cleaning,adding/removing columns and load into cloudsql)
        /src/job2.py ( Reads data from a cloudsql table A and perform some modifications and loads into cloud sql table B)
    
    
    In this case,
        1. Will I be able to download any temporary files in composer server and perform some operations on it?
        2. I shall not be needed to create venv If I place my code in composer directly as I can install packages via PyPI in console?

----------------------------------------------------------

您能帮我提供宝贵的知识吗?提前非常感谢!

非常感谢!

1 个答案:

答案 0 :(得分:1)

这里有一种设计模式,您可以根据自己的需要进行调整。 Task scheduling on Compute Engine with Cloud Scheduler

假设您可以设置 Pub/Sub 主题和订阅,您可以...

  • 在 Composer 中有一个 DAG,它运行一些代码并将消息发布到发布/订阅主题
  • 在 Compute 中运行一个订阅主题的进程。收到消息后,触发您需要运行的脚本。
  • 完成后,通知发布/订阅主题
  • 有一个单独的 DAG,在收到消息时在 Composer 中触发(注意:有多种方法可以做到这一点。请参阅 here)。