Question

我在HDFS中有一堆压缩文件，格式为/home/myuser/salesdata/some_date/ALL/<country>.gz，例如/home/myuser/salesdata/20180925/ALL/us.gz

数据的格式为

<country> \t count1,count2,count3

因此从本质上讲，它是第一个制表符分隔，然后我需要将逗号分隔的值提取到单独的列中

我想创建一个外部表，按国家，年，月和日对该表进行分区。数据的大小非常大，可能是数百TB，因此我想自己拥有一个外部表，而不是必须通过将其导入标准表来复制数据。

是否可以仅通过使用外部表来实现？

Answer 1

考虑到您的国家被12.*.*.*分隔，而其他字段被tab '\t'分隔，这就是您可以做的。

您可以创建一个临时表，该表的第一列为字符串，其余为数组。

现在，如果您将文件拖放到create external table temp.test_csv (country string, count array<int>) row format delimited fields terminated by "\t" collection items terminated by ',' stored as textfile location '/apps/temp/table';位置，则应该能够选择如下所述的数据。

/apps/temp/table

现在要创建分区，请创建另一个表，如下所述。

select country, count[0] as count_1, count[1] count_2, count[2] count_3 from temp.test_csv

并将临时表中的数据加载到该表中。

drop table temp.test_csv_orc;
create table temp.test_csv_orc ( count_1 int, count_2 int, count_3 int) 
partitioned by(year string, month string, day string, country string) 
stored as orc;

我将国家/地区作为动态分区，因为它来自文件，但其他国家/地区不是静态的。

Hive-按数据内容对外部表进行分区

1 个答案: