调用函数与内联代码时的行为不同

时间:2019-06-12 00:53:22

标签: python apache-beam

我试图查看从Python的Apache Beam SDK使用DirectRunner时是否可以将PCollection的元素发送到父进程。

但是,我遇到了一个奇怪的错误,当实例化队列并在脚本的__main__部分中调用管道时,一切似乎都正常运行,但是在子函数中调用相同的代码时却没有。我猜想这是由于场景中发生的某些酸洗/莳萝造成的,但是更具体的解释将不胜感激。

下面使用的/tmp/inputs/winterstale.txt文件可以从以下网址下载:https://storage.googleapis.com/apache-beam-samples/shakespeare/winterstale.txt

from __future__ import print_function

import atexit
import queue
import tempfile
import time
import unittest

import apache_beam as beam
from apache_beam.io.filesystems import FileSystems
from apache_beam.runners.direct.direct_runner import BundleBasedDirectRunner
from apache_beam.runners.interactive.cache_manager import FileBasedCacheManager
from apache_beam.runners.interactive.cache_manager import ReadCache
from apache_beam.runners.interactive.cache_manager import WriteCache


def add_to_queue(element, queue):
  queue.put(element)


def write_to_queue():
  q = queue.Queue()

  with beam.Pipeline(runner=BundleBasedDirectRunner()) as p:
    _ = (
        p
        | "Read" >> beam.io.ReadFromText("/tmp/inputs/winterstale.txt")
        | "Remove whitespace" >> beam.Map(lambda element: element.strip("\n\t|"))
        | "Remove empty lines" >> beam.FlatMap(lambda element: [element] if element else [])
        | "Write" >> beam.Map(lambda element: add_to_queue(element, queue=q))
    )

  return list(q.queue)


if __name__ == "__main__":
  cache_location = tempfile.mkdtemp()
  atexit.register(FileSystems.delete, [cache_location])

  # Using a function call
  cache_manager = FileBasedCacheManager(cache_dir=cache_location)

  result1 = write_to_queue()
  print(len(result1))  # >>> prints "0" <<<

  # Copy-pasing the code from "write_to_queue()"
  q = queue.Queue()

  with beam.Pipeline(runner=BundleBasedDirectRunner()) as p:
    _ = (
        p
        | "Read" >> beam.io.ReadFromText("/tmp/inputs/winterstale.txt")
        | "Remove whitespace" >> beam.Map(lambda element: element.strip("\n\t|"))
        | "Remove empty lines" >> beam.FlatMap(lambda element: [element] if element else [])
        | "Write" >> beam.Map(lambda element: add_to_queue(element, queue=q))
    )

  result2 = list(q.queue)  # >>> prints "3561" <<<
  print(len(result2))

1 个答案:

答案 0 :(得分:1)

通常,将所有东西腌制后再发送给跑步者。在这种情况下,队列对象本身通常会被腌制,并且您的元素在执行期间会附加到未腌制的副本中(因此返回值为0)。我认为这里发生的是BundleBasedDirectRunner对其腌制的内容并不确定(例如,取决于较早的腌制错误,由于包含了从主会话的关闭,它可能会放弃所有腌制尝试并继续使用原始对象)。

可能值得与其他跑步者一起尝试,在这种情况下,行为应保持一致(可能始终为零),并且如果出现酸洗错误,则会提供有意义的信息,而不是加以抑制。