如何将数据复制到另一个表而不覆盖现有列

时间:2015-11-05 10:34:41

标签: hadoop hive amazon-dynamodb

我的Amazon DynamoDB中有2个表:Elements和Containers。层次结构是一个容器可以容纳很少的元素 所以Elements看起来像:uuid,timestamp,container_id,data 我想将所有元素的数据聚合到相应的容器中,例如:
要素:

| uuid | container_id | data |  
| 1    | 1            | 100  |  
| 2    | 1            | 150  |  
| 3    | 2            | 100  |  

所以我想进入容器表:

| uuid | data |  
| 1    | 250  |  
| 2    | 100  |  

所以,使用hive,我编写了脚本(从EMR集群开始):

CREATE EXTERNAL TABLE element (`uuid` string, `container_id ` bigint, `data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Elements", "dynamodb.column.mapping"="uuid:UUID,container_id:container_id,data:data");
CREATE EXTERNAL TABLE container (`uuid` string, `data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Containers", "dynamodb.column.mapping"="uuid:UUID,data:data");
INSERT INTO TABLE container SELECT container_id as `uuid` sum(`data`) as `data` FROM element WHERE container_id IS NOT NULL GROUP BY container_id;

它运行良好,但现在我需要向Containers表写一些额外的数据,所以它应该像uuid, data, another_data。但是当我执行上面的脚本时,它会覆盖所有another_data(未在外部表中列出)。我尝试了很多变种,但无法找到解决方案。

1 个答案:

答案 0 :(得分:0)

好的,我找到了答案:

CREATE EXTERNAL TABLE element (`uuid` string, `container_id ` bigint, `data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Elements", "dynamodb.column.mapping"="uuid:UUID,container_id:container_id,data:data");
CREATE EXTERNAL TABLE container (`uuid` string, `data` double, `another_data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Containers", "dynamodb.column.mapping"="uuid:UUID,data:data,another_data:another_data");
INSERT INTO TABLE container SELECT element.`container_id` as `uuid` sum(element.`data`) as `data`, collect_set(container.`another_data`)[0] as `another_data` FROM element LEFT JOIN container ON (element.`container_id` = container.`uuid`) WHERE element.container_id IS NOT NULL GROUP BY element.container_id;