如何在Python中从Datalab运行数据流作业?

时间:2019-01-11 10:27:39

标签: python google-cloud-platform google-cloud-dataflow apache-beam google-cloud-datalab

从Datalab运行Dataflow作业时遇到麻烦。对于这种情况,我只能做一个最小的Python示例代码,因为Google Cloud Platform或Apache Beam文档中似乎没有提供这种代码。

看看我可以从执行以下操作的Datalab单元中运行一些Python代码对我真的很有帮助。

# 1. Sets up the job

# 2. Defines the processing logic to be applied to the input data files

# 3. Saves the processed files to an output folder

# 4. Submits the job to Google Cloud Dataflow

要解决这个问题,我尝试使用Google和Apache文档中的字数示例,并使其适合在Datalab中使用。用于此的代码如下,但我不清楚我可以删除哪些内容以将其变成真正的最小工作示例。

from __future__ import absolute_import
import argparse
import logging
import re
from past.builtins import unicode
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions

def run(argv=None):
  """Main entry point; defines and runs the wordcount pipeline."""
  parser = argparse.ArgumentParser()
  parser.add_argument('--input',
                      dest='input',
                      default='gs://data-analytics/kinglear.txt',
                      help='Input file to process.')
  parser.add_argument('--output',
                      dest='output',
                      default='gs://data-analytics/output',
                      help='Output file to write results to.')
  known_args, pipeline_args = parser.parse_known_args(argv)
  pipeline_args.extend([
      '--runner=DataflowRunner',
      '--project=project',
      '--staging_location=gs://staging',
      '--temp_location=gs://tmp',
      '--job_name=your-wordcount-job',
  ])

  # We use the save_main_session option because one or more DoFn's in this
  # workflow rely on global context (e.g., a module imported at module level).
  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = True

  with beam.Pipeline(options=pipeline_options) as p:

    # Read the text file[pattern] into a PCollection.
    lines = p | ReadFromText(known_args.input)

    # Count the occurrences of each word.
    counts = (
        lines
        | 'Split' >> (beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
                  .with_output_types(unicode))
        | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
        | 'GroupAndSum' >> beam.CombinePerKey(sum))

    # Format the counts into a PCollection of strings.
    def format_result(word_count):
      (word, count) = word_count
      return '%s: %s' % (word, count)

    output = counts | 'Format' >> beam.Map(format_result)

    # Write the output using a "Write" transform that has side effects.
    output | WriteToText(known_args.output)

if __name__ == '__main__':
  logging.getLogger().setLevel(logging.INFO)
  run()

谢谢!

乔什

2 个答案:

答案 0 :(得分:1)

我认为您在混淆Datalab和Dataflow的关系。这是两个不同的编程平台,您正在将它们混合在一起。您的评论:Defines the processing logic to be applied to the input data files。处理逻辑是Cloud Dataflow的源代码(或模板)提供的,而不是Cloud Datalab笔记本中运行的代码。

作为选项::如果您安装Cloud Dataflow库并使用Python 2.x,则可以在Datalab笔记本中编写Cloud Dataflow(Apache Beam)软件。该代码将在Datalab内部本地运行,并且不会启动Dataflow作业。

有些链接可帮助您编写可创建Cloud Dataflow作业的软件。

这是一个StackOverflow答案,它将向您展示如何在python中启动Dataflow作业:

https://stackoverflow.com/a/52405696/8016720

用于Java的Google Dataflow文档,但对所需步骤进行了很好的解释:

Method: projects.jobs.list

这是指向Dataflow Python客户端API的链接:

Dataflow Client API

答案 1 :(得分:1)

我在以下教程的帮助下解决了这个问题:https://github.com/hayatoy/dataflow-tutorial,现在可以使用以下代码从Datalab启动Dataflow作业。

.allFiltered()

谢谢

乔什