最初我有这样的结构:
+-------+-------+----+----+----+-----+
| time | type | s1 | s2 | id | p1 |
+-------+-------+----+----+----+-----+
| 10:30 | send | a | b | 1 | 110 |
| 10:35 | send | c | d | 1 | 120 |
| 10:31 | reply | e | f | 3 | 221 |
| 10:33 | reply | a | c | 1 | 210 |
| 10:34 | send | a | a | 3 | 113 |
| 10:32 | reply | c | d | 3 | 157 |
+-------+-------+----+----+----+-----+
我想规范化表格:
```
+-------+-------+----+----+----+-----+
| time | type | s1 | s2 | id | p1 |
+-------+-------+----+----+----+-----+
| 10:30 | send | a | b | 1 | 110 |
| 10:35 | send | a | b | 1 | 120 |
| 10:33 | reply | a | b | 1 | 210 |
| 10:31 | reply | a | a | 3 | 221 |
| 10:34 | send | a | a | 3 | 113 |
| 10:32 | reply | a | a | 3 | 157 |
+-------+-------+----+----+----+-----+
这就是我试图解决这个问题的方法:
events_groupby_id = GROUP events BY id;
events_normalized = FOREACH events_groupby_id {
f_reqs = FILTER events BY type matches 'send';
o_reqs = ORDER events BY time ASC;
req = LIMIT o_reqs 1;
GENERATE req, events;
};
我被困在这里。因为我发现events_normalized变成了一个带有嵌套袋的复杂结构,我不知道如何正确地展平。
events_normalized | req:bag {:tuple()} |事件:袋{:元组()}
从这里开始,我应该怎么做才能实现我想要的数据结构?如果有人能帮助我,我将非常感激。谢谢。
答案 0 :(得分:1)
您可以使用events_normalized
FLATTEN
中的行李
events_flattened = FOREACH events_normalized GENERATE
FLATTEN(req),
FLATTEN(events);
这会在req
和events
之间创建一个交叉产品,但由于req
中只有一个元组,因此每个原始条目最终只有一条记录。 events_flattened
的架构是:
req::time | req::type | req::s1 | req::s2 | req::id | req::p1 | events::time | events::type | events::s1 | events::s2 | events::id | events::p1
现在你可以引用你想要保留的字段,使用events
作为原始条目,使用req
作为最旧发送类型条目的替换:
final = FOREACH events_flattened GENERATE
events::time AS time,
events::type AS type,
req::s1 AS s1,
req::s2 AS s2,
events::id AS id,
events::p1 AS p1;