我有一个带有两个标题行的CSV文件。我想删除它们。如何在hive或PIG中删除前两行CSV文件?前面几行文件如下:
YEAR QUARTER MONTH DAY_OF_MONTH DAY_OF_WEEK FL_DATE UNIQUE_CARRIER AIRLINE_ID CARRIER TAIL_NUM FL_NUM ORIGIN ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC DEST DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC CRS_DEP_TIME DEP_TIME DEP_DELAY DEP_DELAY_NEW DEP_DEL15 DEP_DELAY_GROUP DEP_TIME_BLK TAXI_OUT WHEELS_OFF WHEELS_ON TAXI_IN CRS_ARR_TIME ARR_TIME ARR_DELAY ARR_DELAY_NEW ARR_DEL15 ARR_DELAY_GROUP ARR_TIME_BLK CANCELLED CANCELLATION_CODE DIVERTED CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME FLIGHTS DISTANCE DISTANCE_GROUP CARRIER_DELAY WEATHER_DELAY NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY
YEAR QUARTER MONTH DAY_OF_MONTH DAY_OF_WEEK FL_DATE UNIQUE_CARRIER AIRLINE_ID CARRIER TAIL_NUM FL_NUM ORIGIN ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC DEST DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC CRS_DEP_TIME DEP_TIME DEP_DELAY DEP_DELAY_NEW DEP_DEL15 DEP_DELAY_GROUP DEP_TIME_BLK TAXI_OUT WHEELS_OFF WHEELS_ON TAXI_IN CRS_ARR_TIME ARR_TIME ARR_DELAY ARR_DELAY_NEW ARR_DEL15 ARR_DELAY_GROUP ARR_TIME_BLK CANCELLED CANCELLATION_CODE DIVERTED CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME FLIGHTS DISTANCE DISTANCE_GROUP CARRIER_DELAY WEATHER_DELAY NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY
2015 1 1 1 4 2015-01-01 AA 19805 AA N787AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 6 California 91 900 855 -5 0 0 -1 0900-0959 17 912 1230 7 1230 1237 7 7 0 0 1200-1259 0 0 390 402 378 1 2475 10
2015 1 1 2 5 2015-01-02 AA 19805 AA N795AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 6 California 91 900 850 -10 0 0 -1 0900-0959 15 905 1202 9 1230 1211 -19 0 0 -2 1200-1259 0 0 390 381 357 1 2475 10
答案 0 :(得分:3)
试试这个。根据您的要求进行修改:我为每行加载了一行,您也可以为每个字段定义列。
a = LOAD 'file.csv' using TextLoader() as (line:chararray);
b = FILTER a by SUBSTRING(line,0,4) != 'YEAR';
dump b;
或使用Hive:
Create table temp ( Col1 string, col2 int and so on)
row format delimited fields terminated BY '\t' lines terminated BY '\n'
tblproperties("skip.header.line.count"="2");
LOAD data 'file path' into table temp;
这将删除前两行并加载剩余记录