无法在Python中生成模板化数据流

时间:2018-01-17 19:35:18

标签: python google-cloud-dataflow

我试图转换Cloud Dataflow" Wordcount"通过修改pipeline options以将运行时参数用作instructed in the docs的模板化示例:

def run(argv=None):
  """Main entry point; defines and runs the wordcount pipeline."""

  class WordcountTemplatedOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
      # Use add_value_provider_argument for arguments to be templatable
      # Use add_argument as usual for non-templatable arguments
      parser.add_value_provider_argument(
          '--input',
          default='gs://dataflow-samples/shakespeare/kinglear.txt',
          help='Path of the file to read from')
      parser.add_argument(
          '--output',
          required=True,
          help='Output file to write results to.')
  pipeline_options = PipelineOptions(['--output', 'some/output_path'])
  p = beam.Pipeline(options=pipeline_options)
  wordcount_options = pipeline_options.view_as(WordcountTemplatedOptions)

  # Read the text file[pattern] into a PCollection.
  etc. etc.

问题是创建和暂存模板......执行command时,输出为:

INFO:root:Starting the size estimation of the input
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
INFO:root:Finished the size estimation of the input at 1 files. Estimation took 0.288088083267 seconds
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
INFO:root:Starting finalize_write threads with num_shards: 1, batches: 1, num_threads: 1
INFO:root:Renamed 1 shards in 0.13 seconds.
INFO:root:number of empty lines: 1663
INFO:root:average word length: 4

并且template_location下没有生成的文件(gs:// [YOUR_BUCKET_NAME] / templates / mytemplate)...

我认为该命令试图使用"默认"从桌面执行数据流。输入文件,所以我删除了"默认" --input参数中的行,但是我收到了这个错误:

raise BeamIOError('Unable to get the Filesystem', {path: e})
apache_beam.io.filesystem.BeamIOError: Unable to get the Filesystem with exceptions {None: AttributeError("'NoneType' object has no attribute 'strip'",)}

没有官方的python数据流模板样本(我能找到的唯一片段是this one,看起来非常像上面的内容。)

我错过了什么吗?

谢谢!

1 个答案:

答案 0 :(得分:2)

感谢Google云支持 - 我能够解决问题。 总结:

  1. 克隆最新的wordcount.py示例(我使用的是旧版本):

    git clone https://github.com/apache/beam.git

  2. Google小组updated the tutorial,只需按照代码说明操作即可。确保包含@classmethod _add_argparse_args以便能够在运行时接收参数,并在从文本文件中读取时使用新选项:

    wordcount_options = pipeline_options.view_as(WordcountTemplatedOptions) lines = p | '读' >> ReadFromText(wordcount_options.input)

  3. 将模板生成为instructed

  4. 您现在应该在template_location目录下看到模板

    谢谢!