我的Amazon DynamoDB中有2个表:Elements和Containers。层次结构是一个容器可以容纳很少的元素
所以Elements看起来像:uuid,timestamp,container_id,data
我想将所有元素的数据聚合到相应的容器中,例如:
要素:
| uuid | container_id | data |
| 1 | 1 | 100 |
| 2 | 1 | 150 |
| 3 | 2 | 100 |
所以我想进入容器表:
| uuid | data |
| 1 | 250 |
| 2 | 100 |
所以,使用hive,我编写了脚本(从EMR集群开始):
CREATE EXTERNAL TABLE element (`uuid` string, `container_id ` bigint, `data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Elements", "dynamodb.column.mapping"="uuid:UUID,container_id:container_id,data:data");
CREATE EXTERNAL TABLE container (`uuid` string, `data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Containers", "dynamodb.column.mapping"="uuid:UUID,data:data");
INSERT INTO TABLE container SELECT container_id as `uuid` sum(`data`) as `data` FROM element WHERE container_id IS NOT NULL GROUP BY container_id;
它运行良好,但现在我需要向Containers表写一些额外的数据,所以它应该像uuid, data, another_data
。但是当我执行上面的脚本时,它会覆盖所有another_data
(未在外部表中列出)。我尝试了很多变种,但无法找到解决方案。
答案 0 :(得分:0)
好的,我找到了答案:
CREATE EXTERNAL TABLE element (`uuid` string, `container_id ` bigint, `data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Elements", "dynamodb.column.mapping"="uuid:UUID,container_id:container_id,data:data");
CREATE EXTERNAL TABLE container (`uuid` string, `data` double, `another_data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Containers", "dynamodb.column.mapping"="uuid:UUID,data:data,another_data:another_data");
INSERT INTO TABLE container SELECT element.`container_id` as `uuid` sum(element.`data`) as `data`, collect_set(container.`another_data`)[0] as `another_data` FROM element LEFT JOIN container ON (element.`container_id` = container.`uuid`) WHERE element.container_id IS NOT NULL GROUP BY element.container_id;