Pig latin脚本将csv文件中的不同列视为一个单独的列

时间:2017-03-20 05:00:01

标签: apache-pig

我只是粘贴文件中的一行,例如

以下行来自我正在加载到关系

的文件“airlines_new.txt
2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA                                                                                                             ,NA,NA,NA

=============================================== =====

我使用以下查询:

Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt' 
USING PigStorage(' ') AS
(Year, Month, DayofMonth, DayofWeek, DepTime_actual:chararray, CRSDeptime:chararray, Arrtime_actual:chararray, CRSArrtime:chararray, UniqueCarrier, FlightNum, TailNum_Plane ,ActualElapsedTime, CRSElapsedTime, Airtime, Arrdelay, Depdelay, Origin,Dest, Distance, Taxiin, Taxiout, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay);

=============================================== ===========

B = FOREACH Airlines_data_schema generate $0 ;

dump  B ;

=============================================== ==========

结果:

  

(年,月,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCar rier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDela y,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled, CancellationCode,Diverted,Carrie rDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay)   (2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,0,NA,NA,NA,NA,NA)

它将所有列作为单列。但意图是将这些分为不同的列。理想情况下,根据我的脚本,它应该只给出“年”列。

2 个答案:

答案 0 :(得分:0)

记录以逗号分隔,但在您使用的脚本中' '作为分隔符。修改您的脚本以使用','作为PigStorage的分隔符。

Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt' USING PigStorage(',') AS (Year,Month,DayofMonth,DayofWeek,DepTime_actual:chararray,CRSDeptime:chararray,Arrtime_actual:chararray,CRSArrtime:chararray,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay)

答案 1 :(得分:0)

在此方案中需要使用适当的分隔符以确保字段分开。

Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt' 
USING **PigStorage(',')** AS
(Year, Month, DayofMonth, DayofWeek, DepTime_actual:chararray, CRSDeptime:chararray, Arrtime_actual:chararray, CRSArrtime:chararray, UniqueCarrier, FlightNum, TailNum_Plane ,ActualElapsedTime, CRSElapsedTime, Airtime, Arrdelay, Depdelay, Origin,Dest, Distance, Taxiin, Taxiout, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay);.

这将确保您访问csv中分隔的每个字段,','