Question

我已经编写了Python Dataflow作业来处理一些数据：

pipeline
| "read" >> beam.io.ReadFromText(known_args.input)  # 9 min 44 sec
| "parse_line" >> beam.Map(parse_line)  # 4 min 55 sec
| "add_key" >> beam.Map(add_key)  # 48 sec
| "group_by_key" >> beam.GroupByKey()  # 11 min 56 sec
| "map_values" >> beam.ParDo(MapValuesFn())  # 11 min 40 sec
| "json_encode" >> beam.Map(json.dumps)  # 26 sec
| "output" >> beam.io.textio.WriteToText(known_args.output)  # 22 sec

（我已删除特定于业务的语言。）

输入是1.36 GiB gz压缩的CSV，但是该作业需要37分34秒才能运行（我正在使用Dataflow，因为我希望输入的大小会迅速增长）。

如何识别管道中的瓶颈并加快其执行速度？各个功能都不是计算上昂贵的。

从Dataflow控制台自动缩放信息：

12:00:35 PM     Starting a pool of 1 workers. 
12:05:02 PM     Autoscaling: Raised the number of workers to 2 based on the rate of progress in the currently running step(s).
12:10:02 PM     Autoscaling: Reduced the number of workers to 1 based on the rate of progress in the currently running step(s).
12:29:09 PM     Autoscaling: Raised the number of workers to 3 based on the rate of progress in the currently running step(s).
12:35:10 PM     Stopping worker pool.

Answer 1

我搜索了dev@beam.apache.org，发现有一个讨论该主题的主题：https://lists.apache.org/thread.html/f8488faede96c65906216c6b4bc521385abeddc1578c99b85937d2f2@%3Cdev.beam.apache.org%3E

您可以检查此主题以获取有用的信息和/或在需要时提出问题/要求/讨论。

Answer 2

偶然地，我发现这种情况下的问题是CSV的压缩。

输入是单个 gz压缩的CSV。因此我可以更轻松地检查数据，因此我切换到了未压缩的CSV。这样可以将处理时间减少到17分钟以下，并且Dataflow的自动扩展功能可以达到10个工作人员的峰值。

（如果仍然需要压缩，我可以将CSV分成几部分，然后分别压缩每个部分。）

Answer 3

我发现了这个由 Google 提供的 Python Profiler 包：https://cloud.google.com/profiler/docs/profiling-python

如何配置Python Dataflow作业？

3 个答案: