Question

我从CSV创建了一个配置单元表

CREATE TABLE RECORD_CSV(
  completed_on string, distance_travelled double, 
  end_location_lat double, end_location_long double, 
  started_on string, driver_rating double, 
  rider_rating double, start_zip_code int, 
  end_zip_code int, charity_id int, 
  requested_car_category string, free_credit_used double, 
  surge_factor double, start_location_long double, 
  start_location_lat double, color string, 
  make string, model string, year int, 
  rating double, Date string, PRCP double, 
  TMAX double, TMIN double, AWND double, 
  GustSpeed2 double, Fog double, HeavyFog double, 
  Thunder double
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE;

当我跑步时 SELECT COUNT(*) FROM RECORD_CSV;返回

OK
911057
Time taken: 21.403 seconds, Fetched: 1 row(s)

当我使用以下命令创建由color字段划分的另一个表时行数下降。

CREATE TABLE RECORD_CSV_BYCOLOR(completed_on string, distance_travelled double,
end_location_lat double ,end_location_long double,
started_on string ,driver_rating double ,rider_rating double ,
start_zip_code int ,end_zip_code int ,charity_id int,
requested_car_category string,free_credit_used double,
surge_factor double,start_location_long double,start_location_lat double ,
make string ,model string ,year int ,rating double,Date string,PRCP double,
TMAX double,TMIN double,AWND double,GustSpeed2 double,
Fog double,HeavyFog double,Thunder double
)
PARTITIONED BY (color string)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' 
STORED AS TEXTFILE;

INSERT OVERWRITE table RECORD_CSV_BYCOLOR PARTITION(color) 
select completed_on, distance_travelled,end_location_lat, 
end_location_long, started_on, driver_rating, rider_rating,
start_zip_code, end_zip_code, charity_id, requested_car_category,
free_credit_used, surge_factor, start_location_long, start_location_lat,
make, model, year, rating, Date, PRCP, TMAX, TMIN, AWND, GustSpeed2,
Fog, HeavyFog, Thunder, color FROM RECORD_CSV;

运行SELECT COUNT(*) FROM RECORD_CSV_BYCOLOR;时，我看到记录已删除

OK
693991
Time taken: 21.552 seconds, Fetched: 1 row(s)

以下是color对表GROUP BY使用RECORD_CSV的区别

MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 7.11 sec   HDFS Read: 165793766 HDFS Write: 349 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 110 msec
OK
 Silver 634
Black   204004
Bronze  214
Burgundy    1587
GREEN   195
Gold    6346
Gray    644
Maroon  847
Silver  170241
Silver  147
Tan 1066
Teal    913
White   152919
White   404
Yellow/Gold 20540
Blue    90
Brown   18594
Gray    134155
Navy Blue   48
Red 80352
WHITE   52
Yellow  448
Black   361
Blue    81999
Dark Blue   199
Dark Grey   18
Green   15396
Grey    12503
Magenta 324
Orange  5817
Time taken: 25.186 seconds, Fetched: 30 row(s)

及以下RECORD_CSV_BYCOLOR

MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.48 sec   HDFS Read: 30648230 HDFS Write: 281 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 480 msec
OK
 Silver 634
Black   361
Blue    90
Bronze  214
Brown   18594
Burgundy    1587
Dark Blue   199
Dark Grey   18
GREEN   195
Gold    6346
Gray    644
Green   15396
Grey    12503
Magenta 324
Maroon  847
Navy Blue   48
Orange  5817
Red 80352
Silver  147
Tan 1066
Teal    913
WHITE   52
White   404
Yellow  448
Yellow/Gold 20540
Time taken: 20.937 seconds, Fetched: 25 row(s)

源表中的GROUP BY两次给出相同颜色的计数，而目标表则选取计数最少的行。差异似乎存在，但为什么会发生这种情况？我应该更改什么代码？

Hive删除记录数

0 个答案: