在Python Dataflow / Apache Beam上启动CloudSQL代理

时间:2018-06-05 13:46:48

标签: python google-cloud-dataflow google-cloud-sql apache-beam cloud-sql-proxy

我目前正在开发一个ETL Dataflow作业(使用Apache Beam Python SDK),该作业从CloudSQL(带psycopg2和自定义ParDo)查询数据并将其写入BigQuery。我的目标是创建一个Dataflow模板,我可以使用Cron作业从AppEngine开始。

我有一个使用DirectRunner本地工作的版本。为此我使用CloudSQL(Postgres)代理客户端,以便我可以连接到127.0.0.1上的数据库。

将DataflowRunner与自定义命令一起使用以在setup.py脚本中启动代理时,作业将不会执行。 它坚持重复这条日志消息:

Setting node annotation to enable volume controller attach/detach

我的setup.py的一部分看起来如下:

CUSTOM_COMMANDS = [
['echo', 'Custom command worked!'],
['wget', 'https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64', '-O', 'cloud_sql_proxy'],
['echo', 'Proxy downloaded'],
['chmod', '+x', 'cloud_sql_proxy']]

class CustomCommands(setuptools.Command):
  """A setuptools Command class able to run arbitrary commands."""

  def initialize_options(self):
    pass

  def finalize_options(self):
    pass

  def RunCustomCommand(self, command_list):
    print('Running command: %s' % command_list)
    logging.info("Running custom commands")
    p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    # Can use communicate(input='y\n'.encode()) if the command run requires
    # some confirmation.
    stdout_data, _ = p.communicate()
    print('Command output: %s' % stdout_data)
    if p.returncode != 0:
      raise RuntimeError(
          'Command %s failed: exit code: %s' % (command_list, p.returncode))

  def run(self):
    for command in CUSTOM_COMMANDS:
      self.RunCustomCommand(command)
    subprocess.Popen(['./cloud_sql_proxy', '-instances=bi-test-1:europe-west1:test-animal=tcp:5432'])

我在 sthomp 上阅读Github上的pybind/cmake_example问题并在Stackoverflo上进行this讨论后,在subprocess.Popen()内将最后一行添加为单独的run()。我还尝试使用subprocess.Popen的一些参数。

brodin 提到的另一个解决方案是允许从每个IP地址进行访问,并通过用户名和密码进行连接。根据我的理解,他并不认为这是最佳做法。

提前感谢您的帮助。

!!!这篇文章底部的解决方案!!!

更新 - 日志文件

这些是作业期间出现的错误级别的日志:

E  EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities 
E  Image garbage collection failed once. Stats initialization may not have completed yet: unable to find data for container / 
E  Failed to check if disk space is available for the runtime: failed to get fs info for "runtime": unable to find data for container / 
E  Failed to check if disk space is available on the root partition: failed to get fs info for "root": unable to find data for container / 
E  [ContainerManager]: Fail to get rootfs information unable to find data for container / 
E  Could not find capacity information for resource storage.kubernetes.io/scratch 
E  debconf: delaying package configuration, since apt-utils is not installed 
E    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current 
E                                   Dload  Upload   Total   Spent    Left  Speed 
E  
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  3698  100  3698    0     0  25674      0 --:--:-- --:--:-- --:--:-- 25860 



#-- HERE IS WHEN setup.py FOR MY JOB IS EXECUTED ---

E  debconf: delaying package configuration, since apt-utils is not installed 
E  insserv: warning: current start runlevel(s) (empty) of script `stackdriver-extractor' overrides LSB defaults (2 3 4 5). 
E  insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `stackdriver-extractor' overrides LSB defaults (0 1 6). 
E  option = Interval; value = 60.000000; 
E  option = FQDNLookup; value = false; 
E  Created new plugin context. 
E  option = PIDFile; value = /var/run/stackdriver-agent.pid; 
E  option = Interval; value = 60.000000; 
E  option = FQDNLookup; value = false; 
E  Created new plugin context. 

在这里,您可以找到自定义setup.py开始后的所有日志(日志级别:任何;所有日志):

this

更新日志文件2

作业日志(我在没有坚持一段时间后手动取消了作业):

 2018-06-08 (08:02:20) Autoscaling is enabled for job 2018-06-07_23_02_20-5917188751755240698. The number of workers will b...
 2018-06-08 (08:02:20) Autoscaling was automatically enabled for job 2018-06-07_23_02_20-5917188751755240698.
 2018-06-08 (08:02:24) Checking required Cloud APIs are enabled.
 2018-06-08 (08:02:24) Checking permissions granted to controller Service Account.
 2018-06-08 (08:02:25) Worker configuration: n1-standard-1 in europe-west1-b.
 2018-06-08 (08:02:25) Expanding CoGroupByKey operations into optimizable parts.
 2018-06-08 (08:02:25) Combiner lifting skipped for step Save new watermarks/Write/WriteImpl/GroupByKey: GroupByKey not fol...
 2018-06-08 (08:02:25) Combiner lifting skipped for step Group watermarks: GroupByKey not followed by a combiner.
 2018-06-08 (08:02:25) Expanding GroupByKey operations into optimizable parts.
 2018-06-08 (08:02:26) Lifting ValueCombiningMappingFns into MergeBucketsMappingFns
 2018-06-08 (08:02:26) Annotating graph with Autotuner information.
 2018-06-08 (08:02:26) Fusing adjacent ParDo, Read, Write, and Flatten operations
 2018-06-08 (08:02:26) Fusing consumer Get rows from CloudSQL tables into Begin pipeline with watermarks/Read
 2018-06-08 (08:02:26) Fusing consumer Group watermarks/Write into Group watermarks/Reify
 2018-06-08 (08:02:26) Fusing consumer Group watermarks/GroupByWindow into Group watermarks/Read
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/WriteBundles/WriteBundles into Save new watermar...
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/GroupByWindow into Save new watermark...
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/Reify into Save new watermarks/Write/...
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/Write into Save new watermarks/Write/...
 2018-06-08 (08:02:26) Fusing consumer Write to BQ into Get rows from CloudSQL tables
 2018-06-08 (08:02:26) Fusing consumer Group watermarks/Reify into Write to BQ
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/Map(<lambda at iobase.py:926>) into Convert dict...
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/WindowInto(WindowIntoFn) into Save new watermark...
 2018-06-08 (08:02:26) Fusing consumer Convert dictionary list to single dictionary and json into Remove "watermark" label
 2018-06-08 (08:02:26) Fusing consumer Remove "watermark" label into Group watermarks/GroupByWindow
 2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/InitializeWrite into Save new watermarks/Write/W...
 2018-06-08 (08:02:26) Workflow config is missing a default resource spec.
 2018-06-08 (08:02:26) Adding StepResource setup and teardown to workflow graph.
 2018-06-08 (08:02:26) Adding workflow start and stop steps.
 2018-06-08 (08:02:26) Assigning stage ids.
 2018-06-08 (08:02:26) Executing wait step start25
 2018-06-08 (08:02:26) Executing operation Save new watermarks/Write/WriteImpl/DoOnce/Read+Save new watermarks/Write/WriteI...
 2018-06-08 (08:02:26) Executing operation Save new watermarks/Write/WriteImpl/GroupByKey/Create
 2018-06-08 (08:02:26) Starting worker pool setup.
 2018-06-08 (08:02:26) Executing operation Group watermarks/Create
 2018-06-08 (08:02:26) Starting 1 workers in europe-west1-b...
 2018-06-08 (08:02:27) Value "Group watermarks/Session" materialized.
 2018-06-08 (08:02:27) Value "Save new watermarks/Write/WriteImpl/GroupByKey/Session" materialized.
 2018-06-08 (08:02:27) Executing operation Begin pipeline with watermarks/Read+Get rows from CloudSQL tables+Write to BQ+Gr...
 2018-06-08 (08:02:36) Autoscaling: Raised the number of workers to 0 based on the rate of progress in the currently runnin...
 2018-06-08 (08:02:46) Autoscaling: Raised the number of workers to 1 based on the rate of progress in the currently runnin...
 2018-06-08 (08:03:05) Workers have started successfully.
 2018-06-08 (08:11:37) Cancel request is committed for workflow job: 2018-06-07_23_02_20-5917188751755240698.
 2018-06-08 (08:11:38) Cleaning up.
 2018-06-08 (08:11:38) Starting worker pool teardown.
 2018-06-08 (08:11:38) Stopping worker pool...
 2018-06-08 (08:12:30) Autoscaling: Reduced the number of workers to 0 based on the rate of progress in the currently runni...

Stack Traces:

No errors have been received in this time period.

更新:解决方法解决方案可以在我的答案

中找到

2 个答案:

答案 0 :(得分:4)

我设法找到更好或至少更简单的解决方案。 在DoFn设置功能中,使用云代理设置预连接

class MyDoFn(beam.DoFn):
 def setup(self):
    os.system("wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy")
    os.system("chmod +x cloud_sql_proxy")
    os.system(f"./cloud_sql_proxy -instances={self.sql_args['cloud_sql_connection_name']}=tcp:3306 &")

答案 1 :(得分:3)

解决方法解决方案:

我终于找到了解决方法。我想通过CloudSQL实例的公共IP连接。为此,您需要允许从每个IP连接到您的CloudSQL实例:

  1. 转到GCP中的CloudSQL实例的概述页面
  2. 点击Authorization标签
  3. 点击Add network并添加0.0.0.0/0 !!这将允许每个IP地址连接到您的实例!!
  4. 为了增加流程的安全性,我使用了SSL密钥,只允许与实例建立SSL连接:

    1. 点击SSL标签
    2. 点击Create a new certificate为您的服务器创建SSL证书
    3. 点击Create a client certificate为您的客户创建SSL证书
    4. 点击Allow only SSL connections拒绝所有无SSL连接尝试
    5. 之后,我将证书存储在Google Cloud Storage存储桶中并加载 在数据流作业中连接之前,即:

      import psycopg2
      import psycopg2.extensions
      import os
      import stat
      from google.cloud import storage
      
      # Function to wait for open connection when processing parallel
      def wait(conn):
          while 1:
              state = conn.poll()
              if state == psycopg2.extensions.POLL_OK:
                  break
              elif state == psycopg2.extensions.POLL_WRITE:
                  pass
                  select.select([], [conn.fileno()], [])
              elif state == psycopg2.extensions.POLL_READ:
                  pass
                  select.select([conn.fileno()], [], [])
              else:
                  raise psycopg2.OperationalError("poll() returned %s" % state)
      
      # Function which returns a connection which can be used for queries
      def connect_to_db(host, hostaddr, dbname, user, password, sslmode = 'verify-full'):
      
          # Get keys from GCS
          client = storage.Client()
      
          bucket = client.get_bucket(<YOUR_BUCKET_NAME>)
      
          bucket.get_blob('PATH_TO/server-ca.pem').download_to_filename('server-ca.pem')
          bucket.get_blob('PATH_TO/client-key.pem').download_to_filename('client-key.pem')
          os.chmod("client-key.pem", stat.S_IRWXU)
          bucket.get_blob('PATH_TO/client-cert.pem').download_to_filename('client-cert.pem')
      
          sslrootcert = 'server-ca.pem'
          sslkey = 'client-key.pem'
          sslcert = 'client-cert.pem'
      
          con = psycopg2.connect(
              host = host,
              hostaddr = hostaddr,
              dbname = dbname,
              user = user,
              password = password,
              sslmode=sslmode,
              sslrootcert = sslrootcert,
              sslcert = sslcert,
              sslkey = sslkey)
          return con
      

      然后我在自定义ParDo中使用这些功能来执行查询 最小的例子:

      import apache_beam as beam
      
      class ReadSQLTableNames(beam.DoFn):
          '''
          parDo class to get all table names of a given cloudSQL database.
          It will return each table name.
          '''
          def __init__(self, host, hostaddr, dbname, username, password):
              super(ReadSQLTableNames, self).__init__()
              self.host = host
              self.hostaddr = hostaddr
              self.dbname = dbname
              self.username = username
              self.password = password
      
          def process(self, element):
      
              # Connect do database
              con = connect_to_db(host = self.host,
                  hostaddr = self.hostaddr,
                  dbname = self.dbname,
                  user = self.username,
                  password = self.password)
              # Wait for free connection
              wait_select(con)
              # Create cursor to query data
              cur = con.cursor(cursor_factory=RealDictCursor)
      
              # Get all table names
              cur.execute(
              """
              SELECT
              tablename as table
              FROM pg_tables
              WHERE schemaname = 'public'
              """
              )
              table_names = cur.fetchall()
      
              cur.close()
              con.close()
              for table_name in table_names:
                  yield table_name["table"]
      

      管道的一部分可能如下所示:

      # Current workaround to query all tables: 
      # Create a dummy initiator PCollection with one element
      init = p        |'Begin pipeline with initiator' >> beam.Create(['All tables initializer'])
      
      tables = init   |'Get table names' >> beam.ParDo(ReadSQLTableNames(
                                                      host = known_args.host,
                                                      hostaddr = known_args.hostaddr,
                                                      dbname = known_args.db_name,
                                                      username = known_args.user,
                                                      password = known_args.password))
      

      我希望这个解决方案可以帮助其他有类似问题的人