我们可以使用LOAD DATA LOCAL INPATH ...
命令加载Bucketed HIVE表吗?我已经为一个示例文件执行了它,但是数据值被插入为NULL。
hduser@ubuntu:~$ cat /home/hduser/Desktop/hive_external/hive_external/emp2.csv
101,EName1,110.1
102,EName2,120.1
103,EName3,130.1
hive (default)> load data local inpath '/home/hduser/Desktop/hive_external/hive_external' overwrite into table emp_bucket;
Loading data to table default.emp_bucket
Table default.emp_bucket stats: [numFiles=1, numRows=0, totalSize=51, rawDataSize=0]
OK
Time taken: 1.437 seconds
hive (default)> select * from emp_bucket;
OK
emp_bucket.emp_id emp_bucket.emp_name emp_bucket.emp_salary
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
Time taken: 0.354 seconds, Fetched: 3 row(s)
hive (default)> show create table emp_bucket;
OK
createtab_stmt
CREATE TABLE `emp_bucket`(
`emp_id` int,
`emp_name` string,
`emp_salary` float)
CLUSTERED BY (
emp_id)
INTO 3 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://localhost:54310/user/hive/warehouse/emp_bucket'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='51',
'transient_lastDdlTime'='1457967994')
Time taken: 0.801 seconds, Fetched: 22 row(s)
但是当使用insert命令INSERTED时,数据已成功插入INSERTED。
hive (default)> select * from koushik.emp2;
OK
emp2.id emp2.name emp2.salary
101 EName1 110.1
102 EName2 120.1
103 EName3 130.1
Time taken: 0.266 seconds, Fetched: 3 row(s)
hive (default)> insert overwrite table emp_bucket select * from koushik.emp2;
Query ID = hduser_20160314080808_ae88f1c8-3db6-4a5c-99d2-e9a5312c597d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1457951378402_0002, Tracking URL = http://localhost:8088/proxy/application_1457951378402_0002/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1457951378402_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2016-03-14 08:09:33,203 Stage-1 map = 0%, reduce = 0%
2016-03-14 08:09:48,243 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.24 sec
2016-03-14 08:09:59,130 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 6.39 sec
2016-03-14 08:10:02,382 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 8.8 sec
2016-03-14 08:10:03,442 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.03 sec
MapReduce Total cumulative CPU time: 11 seconds 30 msec
Ended Job = job_1457951378402_0002
Loading data to table default.emp_bucket
Table default.emp_bucket stats: [numFiles=3, numRows=3, totalSize=51, rawDataSize=48]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 11.03 sec HDFS Read: 12596 HDFS Write: 273 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 30 msec
OK
emp2.id emp2.name emp2.salary
Time taken: 103.027 seconds
hive (default)> select * from emp_bucket;
OK
emp_bucket.emp_id emp_bucket.emp_name emp_bucket.emp_salary
102 EName2 120.1
103 EName3 130.1
101 EName1 110.1
Time taken: 0.08 seconds, Fetched: 3 row(s)
问题是,是否可以从文件中加载HIVE分区表?
答案 0 :(得分:0)
在将文件加载到分块表中之前,您可能必须启用分段。 使用此选项首先设置bucketing属性,然后加载文件。
set hive.enforce.bucketing = true;
在此处发表评论,如果它不起作用。
答案 1 :(得分:0)
显然,Hive不支持外部表上的存储。因此,您显然必须LOAD DATA INPATH
,而不是INSERT OVERWRITE TABLE ...
路线, cf hadoop tutorial。