Question

我不知道规范化是否是正确的单词，如果您知道，请用正确的单词编辑我的问题。

我有非结构化数据，我已经处理过，并产生了如下数据。

id | group | count
-------------------
1  |   A   | 32
1  |   B   | 1213
1  |   C   | 12
2  |   B   | 12
2  |   C   | 1
3  |   A   | 32
3  |   B   | 1213
3  |   C   | 12

我可以继续处理数据并将结果如下所示吗？

id | A  |  B   | C
-------------------
1  | 32 | 1213 |12
3  | 32 | 1213 |12
2  | 0  |  12  | 1

编辑：

我可以使用类似下面的内容使几乎的数据产生我想要的内容：

Pipe conclusionPipe = new Pipe("conclusionPipe",countPipe);
conclusionPipe = new GroupBy(conclusionPipe, new Fields("id"), new Fields("group"));
conclusionPipe = new Every(conclusionPipe, new Fields("group", "count"),new CustomAggregator(), Fields.RESULTS);

CustomAggregator类：

public static class Context implements Serializable{
    private static final long serialVersionUID = 7038915614929335060L;
    Map<String, Long> counter = new HashMap<String, Long>();
    void add(String key, Long value) {
        Long val = counter.get(key);
        if (val == null) val = 0L;
        counter.put(key, value+val);
    }
    Long get(String key) {
        return counter.get(key);
    }
}
@Override
public void start(FlowProcess flowProcess,
        AggregatorCall<Context> aggregatorCall) {
    aggregatorCall.setContext(new Context());
}

@Override
public void aggregate(FlowProcess flowProcess,
        AggregatorCall<Context> aggregatorCall) {
    aggregatorCall.getContext().add(aggregatorCall.getArguments().getString(0), aggregatorCall.getArguments().getLong(1));

}

@Override
public void complete(FlowProcess flowProcess,
        AggregatorCall<Context> aggregatorCall) {
    Set<String> keySet = aggregatorCall.getContext().counter.keySet();
    Fields field = new Fields();
    field = field.append(aggregatorCall.getGroup().getFields());
    Tuple result = new Tuple(aggregatorCall.getGroup().getTuple());

    field = field.append(new Fields(keySet.toArray(new Comparable[keySet.size()])));
    aggregatorCall.getOutputCollector().setFields(field);
    for (String key:keySet) {
        result.add(aggregatorCall.getContext().get(key));
    }

    aggregatorCall.getOutputCollector().add(result);
}

并有输出点击如下：

Tap outTap = new Hfs(new TextDelimited(true, "\t"), "out");

声明标题的地方。

结果如下：

---empty line----
1  | 32 | 1213 |12
3  | 32 | 1213 |12
2  | 12 |   1

问题是，每一行都不知道标题是什么，每行可能有不同的标题。有没有办法让每一行知道每列的标题？也可以使没有特定标头的每一行都具有零值。但最好将每一行分组为相同的标题，例如：

id | A  |   B  | C
1  | 32 | 1213 |12
3  | 32 | 1213 |12
id | B  |   C
2  | 12 |   1

谢谢！

Hadoop / Cascading：如何压扁结果？

0 个答案: