Question

我在Google云端存储中有以下内容

Advertiser | Event
__________________
100 | Click

101 | Impression

100 | Impression

100 | Impression

101 | Impression

我的管道输出应该是

Advertiser | Count

100 | 3

101 | 2

首先我使用groupByKey，输出就像

100 Click, Impression, Impression

101 Impression, Impression

如何从这里开始？

Answer 1

这个计数模式已在＆＃39;字数＆＃39;中描述。 Apache Beam的样本。

在Github apache beam sample: wordcount.py找到样本。计数从第95行开始。

Answer 2

而不是GroupByKey，您可能希望使用 combine 函数，该函数是在按键分组之前和之后进行优化的组合。你的管道看起来像这样：

<强>的Python

collection_contents = [(100, 'Click'), 
                       (101, 'Impression'), 
                       (100, 'Impression'), 
                       (100, 'Impression'), 
                       (101, 'Impression']

input_collection = pipeline | beam.Create(collection_contents)

counts = input_collection | Count.PerKey()

这应输出具有您正在寻找的形状的集合。 Count模块中提供了apache_beam.transforms.combiners.combine.Count系列变换。

<强>爪哇

org.apache.beam.sdk.transforms包中的Java存在相同的转换：

PCollection<KV<Integer, Integer>> resultColl = inputColl.apply(Count.perKey())

在Google Dataflow中以groupby计算

2 个答案: