使用DataFlow进行求和和平均聚合

时间:2016-01-06 10:01:48

标签: google-cloud-dataflow

我有以下类型的样本数据。

s.n., time, user, time_span, user_level
1, 2016-01-04T1:26:13, Hari, 8, admin
2, 2016-01-04T11:6:13, Gita, 2, admin
3, 2016-01-04T11:26:13, Gita, 0, user

现在我需要找到average_time_span/useraverage_time_span/user_leveltotal_time_span/user

我能够找到上面提到的每一个值但是无法一次找到所有这些值。由于我是DataFlow的新手,请建议我采用适当的方法。

static class ExtractUserAndUserLevelFn extends DoFn<String, KV<String, Long>> {
        @Override
        public void processElement(ProcessContext c) {

            String[] words = c.element().split(",");

            if (words.length == 5) {
                Instant timestamp = Instant.parse(words[1].trim());                    
                KV<String, Long> userTime = KV.of(words[2].trim(), Long.valueOf(words[3].trim()));
                KV<String, Long> userLevelTime = KV.of(words[4].trim(), Long.valueOf(words[3].trim()));                    
                c.outputWithTimestamp(userTime, timestamp);
                c.outputWithTimestamp(userLevelTime, timestamp);

            }
        }
    }


public static void main(String[] args) {
    TestOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
            .as(TestOptions.class);
    Pipeline p = Pipeline.create(options);
    p.apply(TextIO.Read.named("ReadLines").from(options.getInputFile()))
            .apply(ParDo.of(new ExtractUserAndUserLevelFn()))
            .apply(Window.<KV<String, Long>>into(
                    FixedWindows.of(Duration.standardSeconds(options.getMyWindowSize()))))
            .apply(GroupByKey.<String, Long>create())
            .apply(ParDo.of(new DoFn<KV<String, Iterable<Long>>, KV<String, Long>>() {
                public void processElement(ProcessContext c) {
                    String key = c.element().getKey();
                    Iterable<Long> docsWithThatUrl = c.element().getValue();
                    Long sum = 0L;
                    for (Long item : docsWithThatUrl)
                        sum += item;
                    KV<String, Long> userTime = KV.of(key, sum);
                    c.output(userTime);
                }
            }))
            .apply(MapElements.via(new FormatAsTextFn()))
            .apply(TextIO.Write.named("WriteCounts").to(options.getOutput()).
                    withNumShards(options.getShardsNumber()));

    p.run();
}

2 个答案:

答案 0 :(得分:2)

一种方法是首先将行解析为一个包含每行记录的PCollection,并从该集合创建两个键值对的PCollection。假设你定义一条表示这样一行的记录:

static class Record implements Serializable {
  final String user;
  final String role;
  final long duration;
  // need a constructor here
}

现在,创建一个LineToRecordFn,从输入行创建记录,以便您可以这样做:

PCollection<Record> records = p.apply(TextIO.Read.named("ReadLines")
                               .from(options.getInputFile()))
                               .apply(ParDo.of(new LineToRecordFn()));

如果你愿意,你可以在这里窗口。无论您是否在窗口,您都可以创建逐个角色和按键的用户PCollections:

PCollection<KV<String,Long>> role_duration = records.apply(MapElements.via(
    new SimpleFunction<Record,KV<String,Long>>() {
          @Override
          public KV<String,Long> apply(Record r) {
            return KV.of(r.role,r.duration);
          }
        }));

PCollection<KV<String,Long>> user_duration = records.apply(MapElements.via(
    new SimpleFunction<Record,KV<String,Long>>() {
              @Override
              public KV<String,Long> apply(Record r) {
                return KV.of(r.user, r.duration);
              }
            }));

现在,你可以用几行来获得手段和总和:

PCollection<KV<String,Double>> mean_by_user = user_duration.apply(
    Mean.<String,Long>perKey());
PCollection<KV<String,Double>> mean_by_role = role_duration.apply(
    Mean.<String,Long>perKey()); 
PCollection<KV<String,Long>> sum_by_role = role_duration.apply(
    Sum.<String>longsPerKey());

请注意,数据流在运行作业之前会进行一些优化。所以,虽然看起来你在记录PCollection上做了两次传递,但这可能不是真的。

答案 1 :(得分:1)

MeanSum转换看起来很适合这个用例。基本用法如下:

 PCollection<KV<String, Double>> meanPerKey =
     input.apply(Mean.<String, Integer>perKey());

 PCollection<KV<String, Integer>> sumPerKey = input
     .apply(Sum.<String>integersPerKey());