我不知道规范化是否是正确的单词,如果您知道,请用正确的单词编辑我的问题。
我有非结构化数据,我已经处理过,并产生了如下数据。
id | group | count
-------------------
1 | A | 32
1 | B | 1213
1 | C | 12
2 | B | 12
2 | C | 1
3 | A | 32
3 | B | 1213
3 | C | 12
我可以继续处理数据并将结果如下所示吗?
id | A | B | C
-------------------
1 | 32 | 1213 |12
3 | 32 | 1213 |12
2 | 0 | 12 | 1
编辑:
我可以使用类似下面的内容使几乎的数据产生我想要的内容:
Pipe conclusionPipe = new Pipe("conclusionPipe",countPipe);
conclusionPipe = new GroupBy(conclusionPipe, new Fields("id"), new Fields("group"));
conclusionPipe = new Every(conclusionPipe, new Fields("group", "count"),new CustomAggregator(), Fields.RESULTS);
CustomAggregator
类:
public static class Context implements Serializable{
private static final long serialVersionUID = 7038915614929335060L;
Map<String, Long> counter = new HashMap<String, Long>();
void add(String key, Long value) {
Long val = counter.get(key);
if (val == null) val = 0L;
counter.put(key, value+val);
}
Long get(String key) {
return counter.get(key);
}
}
@Override
public void start(FlowProcess flowProcess,
AggregatorCall<Context> aggregatorCall) {
aggregatorCall.setContext(new Context());
}
@Override
public void aggregate(FlowProcess flowProcess,
AggregatorCall<Context> aggregatorCall) {
aggregatorCall.getContext().add(aggregatorCall.getArguments().getString(0), aggregatorCall.getArguments().getLong(1));
}
@Override
public void complete(FlowProcess flowProcess,
AggregatorCall<Context> aggregatorCall) {
Set<String> keySet = aggregatorCall.getContext().counter.keySet();
Fields field = new Fields();
field = field.append(aggregatorCall.getGroup().getFields());
Tuple result = new Tuple(aggregatorCall.getGroup().getTuple());
field = field.append(new Fields(keySet.toArray(new Comparable[keySet.size()])));
aggregatorCall.getOutputCollector().setFields(field);
for (String key:keySet) {
result.add(aggregatorCall.getContext().get(key));
}
aggregatorCall.getOutputCollector().add(result);
}
并有输出点击如下:
Tap outTap = new Hfs(new TextDelimited(true, "\t"), "out");
声明标题的地方。
结果如下:
---empty line----
1 | 32 | 1213 |12
3 | 32 | 1213 |12
2 | 12 | 1
问题是,每一行都不知道标题是什么,每行可能有不同的标题。有没有办法让每一行知道每列的标题?也可以使没有特定标头的每一行都具有零值。但最好将每一行分组为相同的标题,例如:
id | A | B | C
1 | 32 | 1213 |12
3 | 32 | 1213 |12
id | B | C
2 | 12 | 1
谢谢!