Python中的数据流未显示Pubsub订阅的输出集合

时间:2019-07-18 21:03:15

标签: python google-cloud-dataflow dataflow

我有一个用Python编写的Dataflow作业。这非常简单,仅从订阅中读取内容,应用固定窗口,然后写入GCS。

问题是从订阅中读取后,FixedWindow不显示任何输出集合。

我一直在尝试任何没有运气的事情。

这是我的代码

import apache_beam as beam
import argparse
import logging
import apache_beam.transforms.window as window
from apache_beam.io.gcp.pubsub import ReadFromPubSub
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions

def run(argv=None):

    """Main entry point; defines and runs the wordcount pipeline."""
    parser = argparse.ArgumentParser()
    parser.add_argument('--input',
                        dest='input',
                        required=True,
                        default='gs://dataflow-samples/shakespeare/kinglear.txt',
                        help='Input file to process.')
    parser.add_argument('--output',
                        dest='output',
                        required=True,
                        default='gs://dataflow-samples/',
                        help='Output file to write results to.')
    parser.add_argument('--topic',
                        dest='topic',
                        required=True,
                        help='Topic for message.')
    parser.add_argument('--subscription',
                        dest='subscription',
                        required=True,
                        help='Subscription for message.')
    parser.add_argument('--entity_type',
                        dest='entity_type',
                        required=True,
                        help='Entity Type for message.')
    parser.add_argument('--event_type',
                        dest='event_type',
                        required=True,
                        help='Event Type for message.')                        
    parser.add_argument('--outputFilenamePrefix',
                        dest='outputFilenamePrefix',
                        required=True,
                        help='Output Filename Prefix Type for message.')   
    parser.add_argument('--outputFilenameSuffix',
                        dest='outputFilenameSuffix',
                        required=True,
                        help='Output Filename Suffix Type for message.') 

    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = True
    pipeline_options.view_as(StandardOptions).streaming= True
    p = beam.Pipeline(options=pipeline_options)

    if known_args.subscription:
        messages = (p 
                | ReadFromPubSub(subscription=known_args.subscription, with_attributes=True))
    else:
        messages = (p 
                | ReadFromPubSub(subscription=known_args.topic, with_attributes=True))

    (messages 
            | beam.WindowInto(window.FixedWindows(120))
            | beam.io.WriteToText(known_args.output + known_args.outputFilenamePrefix, 
                                    file_name_suffix=known_args.outputFilenameSuffix,
                                    num_shards=1))

    result = p.run()
    result.wait_until_finish()

if __name__ == "__main__":
    logging.getLogger().setLevel(logging.INFO)
    run()

这个想法是将结果保存在提供的存储桶中。我一直在阅读,尚不支持无限制数据中的某些功能,例如Window。在这种情况下,唯一的解决方案是使用Java。

1 个答案:

答案 0 :(得分:1)

写入GCS时,Python不支持

WriteToText。在您讨论时,这将在Java中起作用。或者,您可以将记录写入到单独的IO(例如BigQuery)。

Supported IO Builts ins