我有大量数据集,其中1000列存储在HDFS上。我想创建一个hive表来过滤和处理数据。
CREATE EXTERNAL TABLE IF NOT EXISTS tablename(
var1 INT,var2 STRING, var2 STRING)
COMMENT 'testbykasa'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/folder1/';
对于较小的没有。列(~5-10),我手动指定列名和列类型。有没有办法让hive通过推断列名和数据类型来创建表,而无需手动指定它。
答案 0 :(得分:1)
mydata.csv
2,2,8,1,5,1,8,1,4,1,3,4,9,2,8,2,6,5,3,1,5,5,8,0,1,6,0,7,1,4
2,6,8,7,7,9,9,3,8,7,3,1,9,1,7,5,9,7,1,2,5,7,0,5,1,2,6,4,0,4
0,0,1,3,6,5,6,2,4,2,4,9,0,4,9,8,1,0,2,8,4,7,8,3,9,7,8,9,5,5
3,4,9,1,8,7,4,2,1,0,4,3,1,4,6,6,7,4,9,9,6,7,9,5,2,2,8,0,2,9
3,4,8,9,9,1,5,2,7,4,7,1,4,9,8,9,3,3,2,3,3,5,4,8,6,5,8,8,6,4
4,0,6,9,3,2,4,2,9,4,6,8,8,2,6,7,1,7,3,1,6,6,5,2,9,9,4,6,9,7
7,0,9,3,7,6,5,5,7,2,4,2,7,4,6,1,0,9,8,2,5,7,1,4,0,4,3,9,4,3
2,8,3,7,7,3,3,6,9,3,5,5,0,7,5,3,6,2,9,0,8,2,3,0,6,2,4,3,2,6
3,2,0,8,8,8,1,8,4,0,5,2,5,0,2,0,4,1,2,2,1,0,2,8,6,7,2,2,7,0
0,5,9,1,0,3,1,9,3,6,2,1,5,0,6,6,3,8,2,8,0,0,1,9,1,5,5,2,4,8
create external table mycsv (rec string)
row format delimited
stored as textfile
tblproperties ('serialization.last.column.takes.rest'='true')
;
select pe.pos + 1 as col
,count(distinct pe.val) as count_distinct_val
from mycsv
lateral view posexplode(split(rec,',')) pe
group by pe.pos
;
+------+---------------------+
| col | count_distinct_val |
+------+---------------------+
| 1 | 5 |
| 2 | 6 |
| 3 | 6 |
| 4 | 5 |
| 5 | 7 |
| 6 | 8 |
| 7 | 7 |
| 8 | 7 |
| 9 | 6 |
| 10 | 7 |
| 11 | 6 |
| 12 | 7 |
| 13 | 7 |
| 14 | 6 |
| 15 | 6 |
| 16 | 9 |
| 17 | 7 |
| 18 | 9 |
| 19 | 5 |
| 20 | 6 |
| 21 | 7 |
| 22 | 5 |
| 23 | 8 |
| 24 | 7 |
| 25 | 5 |
| 26 | 6 |
| 27 | 7 |
| 28 | 8 |
| 29 | 8 |
| 30 | 8 |
+------+---------------------+
答案 1 :(得分:0)
是的,这是可能的,但不是SQL脚本。为此,我使用Python脚本读取csv文件的第一行,并使用pyhive库创建一个动态发送到Hive的脚本(并擦除csv的第一行)。要识别类型,只需使用Python函数来发现是否为String,Number等。 Python的问题在于它只适用于Python 2.7,因此我建议您考虑在Scala上执行相同的代码。