我有以下类型的样本数据。
s.n., time, user, time_span, user_level
1, 2016-01-04T1:26:13, Hari, 8, admin
2, 2016-01-04T11:6:13, Gita, 2, admin
3, 2016-01-04T11:26:13, Gita, 0, user
现在我需要找到average_time_span/user
,average_time_span/user_level
和total_time_span/user
。
我能够找到上面提到的每一个值但是无法一次找到所有这些值。由于我是DataFlow的新手,请建议我采用适当的方法。
static class ExtractUserAndUserLevelFn extends DoFn<String, KV<String, Long>> {
@Override
public void processElement(ProcessContext c) {
String[] words = c.element().split(",");
if (words.length == 5) {
Instant timestamp = Instant.parse(words[1].trim());
KV<String, Long> userTime = KV.of(words[2].trim(), Long.valueOf(words[3].trim()));
KV<String, Long> userLevelTime = KV.of(words[4].trim(), Long.valueOf(words[3].trim()));
c.outputWithTimestamp(userTime, timestamp);
c.outputWithTimestamp(userLevelTime, timestamp);
}
}
}
public static void main(String[] args) {
TestOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(TestOptions.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadLines").from(options.getInputFile()))
.apply(ParDo.of(new ExtractUserAndUserLevelFn()))
.apply(Window.<KV<String, Long>>into(
FixedWindows.of(Duration.standardSeconds(options.getMyWindowSize()))))
.apply(GroupByKey.<String, Long>create())
.apply(ParDo.of(new DoFn<KV<String, Iterable<Long>>, KV<String, Long>>() {
public void processElement(ProcessContext c) {
String key = c.element().getKey();
Iterable<Long> docsWithThatUrl = c.element().getValue();
Long sum = 0L;
for (Long item : docsWithThatUrl)
sum += item;
KV<String, Long> userTime = KV.of(key, sum);
c.output(userTime);
}
}))
.apply(MapElements.via(new FormatAsTextFn()))
.apply(TextIO.Write.named("WriteCounts").to(options.getOutput()).
withNumShards(options.getShardsNumber()));
p.run();
}
答案 0 :(得分:2)
一种方法是首先将行解析为一个包含每行记录的PCollection,并从该集合创建两个键值对的PCollection。假设你定义一条表示这样一行的记录:
static class Record implements Serializable {
final String user;
final String role;
final long duration;
// need a constructor here
}
现在,创建一个LineToRecordFn,从输入行创建记录,以便您可以这样做:
PCollection<Record> records = p.apply(TextIO.Read.named("ReadLines")
.from(options.getInputFile()))
.apply(ParDo.of(new LineToRecordFn()));
如果你愿意,你可以在这里窗口。无论您是否在窗口,您都可以创建逐个角色和按键的用户PCollections:
PCollection<KV<String,Long>> role_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
@Override
public KV<String,Long> apply(Record r) {
return KV.of(r.role,r.duration);
}
}));
PCollection<KV<String,Long>> user_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
@Override
public KV<String,Long> apply(Record r) {
return KV.of(r.user, r.duration);
}
}));
现在,你可以用几行来获得手段和总和:
PCollection<KV<String,Double>> mean_by_user = user_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Double>> mean_by_role = role_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Long>> sum_by_role = role_duration.apply(
Sum.<String>longsPerKey());
请注意,数据流在运行作业之前会进行一些优化。所以,虽然看起来你在记录PCollection上做了两次传递,但这可能不是真的。
答案 1 :(得分:1)