Question

我目前正在使用流管道处理事件并将它们推送到名为 EventsTable 的BigQuery表中：

TransactionID    EventType
1                typeA
1                typeB
1                typeB
1                typeC
2                typeA
2                typeC
3                typeA

我想在我的处理管道中添加一个分支，并将与交易相关的数据“分组”到一个 TransactionsTable 。粗略地说， TransactionsTable 中类型列中的值将是给定事务的相关eventType的计数。使用前面的示例事件，输出将如下所示：

TransactionID      typeA     typeB     typeC
1                  1         2         1
2                  1         0         1
3                  1         0         0

“type”列的数量将等于系统中存在的不同eventType的数量。

我正试着看看我如何使用Dataflow做到这一点，但找不到任何干净的方法来做到这一点。我知道PCollections是不可变的，所以我不能将传入的数据存储在不断增长的PCollection结构中，该结构将传入的事件排队到需要的其他元素存在的时刻，并且我可以将它们写入第二个BigQuery表。是否有一种窗口函数允许使用Dataflow执行此操作（例如在具有某种到期日期的临时窗口结构中排队事件）？

我可能会对批量作业和PubSub做些什么，但这会复杂得多。另一方面，我确实理解Dataflow并不意味着不断增长的数据结构，并且数据一旦进入，就必须通过管道并退出（或被丢弃）。我错过了什么吗？

Answer 1

一般来说，最简单的方法是在很多事件中聚合数据＆＃34;是使用CombineFn，它允许您组合与特定键关联的所有值。这通常比排队事件更有效，因为它只需要累积结果而不是累积所有事件。

根据您的具体情况，您可以创建自定义CombineFn。累加器将是Map<EventType, Long>。例如：

public class TypedCountCombineFn
    extends CombineFn<EventType, Map<EventType, Long>, TransactionRow> {
  @Override
  public Map<EventType, Long> createAccumulator() {
    return new HashMap<>();
  }
  @Override
  public Map<EventType, Long> addInput(
      Map<EventType, Long> accum, EventType input) {
    Long count = accum.get(input);
    if (count == null) { count = 0; accum.put(input, count); }
    count++;
    return accum;
  }
  @Override
  public Map<EventType, Long> mergeAccumulators(
      Iterable<Map<EventType, Long>> accums) {
    // TODO: Sum up all the counts for similar event types
  }
  @Override
  public TransactionRow extractOutput(Map<EventType, Long> accum) {
    // TODO: Build an output row from the per-event-type accumulator
  }
}

应用此CombineFn可以全局（在PCollection中的所有事务中）或per-key（例如每个事务ID）完成：

PCollection<EventType> pc = ...;

// Globally
PCollection<TransactionRow> globalCounts = pc.apply(Combine.globally(new TypedCountCombineFn()));

// PerKey
PCollection<KV<Long, EventType>> keyedPC = pc.apply(WithKeys.of(new SerializableFunction<EventType, Long>() {
  @Override
  public long apply(EventType in) {
    return in.getTransactionId();
  }
});
PCollection<KV<Long, TransactionRow>> keyedCounts =
  keyedPC.apply(Combine.perKey(new TypedCountCombineFn()));

Google Dataflow中的数据重塑操作

1 个答案: