Question

在Apache Beam的上下文中，我对辅助输入和广播有一个一般性的问题。是否需要在processElement期间进行计算所需的任何其他变量，列表，映射作为边输入？如果将它们作为DoFn的常规构造函数参数传递可以吗？例如，如果我有一些固定的（未计算）值变量（常量，例如开始日期，结束日期），我想在processElement的按元素计算期间使用该变量。现在，我可以使每个变量分别成为PCollectionView，并将它们传递给DoFn构造函数作为侧面输入。但是，除了这样做，我是否只能将每个常量作为常规构造函数参数传递给DoFn？我在这里想念些什么吗？

在代码方面，我应该什么时候做：

public static class MyFilter extends DoFn<KV<String, Iterable<MyData>> {
  // these are singleton views
  private final PCollectionView<LocalDateTime> dateStartView;
  private final PCollectionView<LocalDateTime> dateEndView;

  public MyFilter(PCollectionView<LocalDateTime> dateStartView,
                       PCollectionView<LocalDateTime> dateEndView){

      this.dateStartView = dateStartView;
      this.dateEndView = dateEndView;
  }

  @ProcessElement
  public void processElement(ProcessContext c) throws Exception{
  // extract date values from the singleton views here and use them

相对于：

public static class MyFilter extends DoFn<KV<String, Iterable<MyData>> {
  private final LocalDateTime dateStart;
  private final LocalDateTime dateEnd;

  public MyFilter(LocalDateTime dateStart,
                       LocalDateTime dateEnd){

    this.dateStart = dateStart;
    this.dateEnd = dateEnd;
  }

  @ProcessElement
  public void processElement(ProcessContext c) throws Exception{
  // use the passed in date values directly here

请注意，在这些示例中，startDate和endDate是固定值，而不是管道的任何先前计算的动态结果。

Answer 1

调用pipeline.apply(ParDo.of(new MyFilter(...))之类的内容时，DoFn会在用于启动管道的main程序中实例化。然后将其序列化并传递给运行程序以执行。然后Runner决定在哪里执行它，例如在由100个VM组成的机队中，每个VM都会收到自己的代码和序列化数据副本。如果成员变量是可序列化的，并且您在执行时不对它们进行突变，那应该很好（link，link），DoFn将在每个节点上反序列化，并且所有字段，将按预期执行。但是，您无法控制实例的数量或实例的生命周期（to some extent），因此mutate them后果自负。

PCollections和侧面输入的好处在于您不仅限于静态值，因此对于几个简单的不变值，您应该没事。

Apache Beam

1 个答案: