Question

我有一个PCollection从AvroIO读取数据。我想应用聚合，以便在按特定键分组之后，我要计算该组中某些字段的唯一计数。

对于普通的Pig或SQL，这只是应用groupby并进行了不同的计数，但是无法正确地理解如何在Beam中做到这一点。

到目前为止，我已经可以编写以下内容：

Schema schema = new Schema.Parser().parse(new File(options.getInputSchema()));

Pipeline pipeline = Pipeline.create(options);
PCollection<GenericRecord> inputData= pipeline.apply(AvroIO.readGenericRecords(schema).from(options.getInput()));
PCollection<Row> filteredData = inputData.apply(Select.fieldNames("user_id", "field1", "field2"));
PCollection<Row> groupedData = filteredData.apply(Group.byFieldNames("user_id")                
                .aggregateField("field1",Count.perElement(),"out_field1")
                .aggregateField("field2",Count.perElement(),"out_field2"));

但是，这不接受aggregateField方法中的参数。

有人可以帮助您提供执行此操作的正确方法。

谢谢！

Apache Beam按汇总字段分组

0 个答案: