我从CSV创建了一个配置单元表
CREATE TABLE RECORD_CSV(
completed_on string, distance_travelled double,
end_location_lat double, end_location_long double,
started_on string, driver_rating double,
rider_rating double, start_zip_code int,
end_zip_code int, charity_id int,
requested_car_category string, free_credit_used double,
surge_factor double, start_location_long double,
start_location_lat double, color string,
make string, model string, year int,
rating double, Date string, PRCP double,
TMAX double, TMIN double, AWND double,
GustSpeed2 double, Fog double, HeavyFog double,
Thunder double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
当我跑步时
SELECT COUNT(*) FROM RECORD_CSV;
返回
OK
911057
Time taken: 21.403 seconds, Fetched: 1 row(s)
当我使用以下命令创建由color
字段划分的另一个表时
行数下降。
CREATE TABLE RECORD_CSV_BYCOLOR(completed_on string, distance_travelled double,
end_location_lat double ,end_location_long double,
started_on string ,driver_rating double ,rider_rating double ,
start_zip_code int ,end_zip_code int ,charity_id int,
requested_car_category string,free_credit_used double,
surge_factor double,start_location_long double,start_location_lat double ,
make string ,model string ,year int ,rating double,Date string,PRCP double,
TMAX double,TMIN double,AWND double,GustSpeed2 double,
Fog double,HeavyFog double,Thunder double
)
PARTITIONED BY (color string)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','
STORED AS TEXTFILE;
INSERT OVERWRITE table RECORD_CSV_BYCOLOR PARTITION(color)
select completed_on, distance_travelled,end_location_lat,
end_location_long, started_on, driver_rating, rider_rating,
start_zip_code, end_zip_code, charity_id, requested_car_category,
free_credit_used, surge_factor, start_location_long, start_location_lat,
make, model, year, rating, Date, PRCP, TMAX, TMIN, AWND, GustSpeed2,
Fog, HeavyFog, Thunder, color FROM RECORD_CSV;
运行SELECT COUNT(*) FROM RECORD_CSV_BYCOLOR;
时,我看到记录已删除
OK
693991
Time taken: 21.552 seconds, Fetched: 1 row(s)
以下是color
对表GROUP BY
使用RECORD_CSV
的区别
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 7.11 sec HDFS Read: 165793766 HDFS Write: 349 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 110 msec
OK
Silver 634
Black 204004
Bronze 214
Burgundy 1587
GREEN 195
Gold 6346
Gray 644
Maroon 847
Silver 170241
Silver 147
Tan 1066
Teal 913
White 152919
White 404
Yellow/Gold 20540
Blue 90
Brown 18594
Gray 134155
Navy Blue 48
Red 80352
WHITE 52
Yellow 448
Black 361
Blue 81999
Dark Blue 199
Dark Grey 18
Green 15396
Grey 12503
Magenta 324
Orange 5817
Time taken: 25.186 seconds, Fetched: 30 row(s)
及以下RECORD_CSV_BYCOLOR
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.48 sec HDFS Read: 30648230 HDFS Write: 281 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 480 msec
OK
Silver 634
Black 361
Blue 90
Bronze 214
Brown 18594
Burgundy 1587
Dark Blue 199
Dark Grey 18
GREEN 195
Gold 6346
Gray 644
Green 15396
Grey 12503
Magenta 324
Maroon 847
Navy Blue 48
Orange 5817
Red 80352
Silver 147
Tan 1066
Teal 913
WHITE 52
White 404
Yellow 448
Yellow/Gold 20540
Time taken: 20.937 seconds, Fetched: 25 row(s)
源表中的GROUP BY
两次给出相同颜色的计数,而目标表则选取计数最少的行。差异似乎存在,但为什么会发生这种情况?我应该更改什么代码?