我有一个用Python编写的Dataflow作业。这非常简单,仅从订阅中读取内容,应用固定窗口,然后写入GCS。
问题是从订阅中读取后,FixedWindow不显示任何输出集合。
我一直在尝试任何没有运气的事情。
这是我的代码
import apache_beam as beam
import argparse
import logging
import apache_beam.transforms.window as window
from apache_beam.io.gcp.pubsub import ReadFromPubSub
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
def run(argv=None):
"""Main entry point; defines and runs the wordcount pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument('--input',
dest='input',
required=True,
default='gs://dataflow-samples/shakespeare/kinglear.txt',
help='Input file to process.')
parser.add_argument('--output',
dest='output',
required=True,
default='gs://dataflow-samples/',
help='Output file to write results to.')
parser.add_argument('--topic',
dest='topic',
required=True,
help='Topic for message.')
parser.add_argument('--subscription',
dest='subscription',
required=True,
help='Subscription for message.')
parser.add_argument('--entity_type',
dest='entity_type',
required=True,
help='Entity Type for message.')
parser.add_argument('--event_type',
dest='event_type',
required=True,
help='Event Type for message.')
parser.add_argument('--outputFilenamePrefix',
dest='outputFilenamePrefix',
required=True,
help='Output Filename Prefix Type for message.')
parser.add_argument('--outputFilenameSuffix',
dest='outputFilenameSuffix',
required=True,
help='Output Filename Suffix Type for message.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
pipeline_options.view_as(StandardOptions).streaming= True
p = beam.Pipeline(options=pipeline_options)
if known_args.subscription:
messages = (p
| ReadFromPubSub(subscription=known_args.subscription, with_attributes=True))
else:
messages = (p
| ReadFromPubSub(subscription=known_args.topic, with_attributes=True))
(messages
| beam.WindowInto(window.FixedWindows(120))
| beam.io.WriteToText(known_args.output + known_args.outputFilenamePrefix,
file_name_suffix=known_args.outputFilenameSuffix,
num_shards=1))
result = p.run()
result.wait_until_finish()
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run()
这个想法是将结果保存在提供的存储桶中。我一直在阅读,尚不支持无限制数据中的某些功能,例如Window。在这种情况下,唯一的解决方案是使用Java。
答案 0 :(得分:1)
WriteToText。在您讨论时,这将在Java中起作用。或者,您可以将记录写入到单独的IO(例如BigQuery)。