我只是粘贴文件中的一行,例如
以下行来自我正在加载到关系
的文件“airlines_new.txt
”
2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA ,NA,NA,NA
=============================================== =====
我使用以下查询:
Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt'
USING PigStorage(' ') AS
(Year, Month, DayofMonth, DayofWeek, DepTime_actual:chararray, CRSDeptime:chararray, Arrtime_actual:chararray, CRSArrtime:chararray, UniqueCarrier, FlightNum, TailNum_Plane ,ActualElapsedTime, CRSElapsedTime, Airtime, Arrdelay, Depdelay, Origin,Dest, Distance, Taxiin, Taxiout, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay);
=============================================== ===========
B = FOREACH Airlines_data_schema generate $0 ;
dump B ;
=============================================== ==========
结果:
(年,月,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCar rier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDela y,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled, CancellationCode,Diverted,Carrie rDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay) (2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,0,NA,NA,NA,NA,NA)
它将所有列作为单列。但意图是将这些分为不同的列。理想情况下,根据我的脚本,它应该只给出“年”列。
答案 0 :(得分:0)
记录以逗号分隔,但在您使用的脚本中' '作为分隔符。修改您的脚本以使用','作为PigStorage的分隔符。
Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt' USING PigStorage(',') AS (Year,Month,DayofMonth,DayofWeek,DepTime_actual:chararray,CRSDeptime:chararray,Arrtime_actual:chararray,CRSArrtime:chararray,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay)
答案 1 :(得分:0)
在此方案中需要使用适当的分隔符以确保字段分开。
Airlines_data_schema = LOAD '/user/Jig13517/airlines_new.txt'
USING **PigStorage(',')** AS
(Year, Month, DayofMonth, DayofWeek, DepTime_actual:chararray, CRSDeptime:chararray, Arrtime_actual:chararray, CRSArrtime:chararray, UniqueCarrier, FlightNum, TailNum_Plane ,ActualElapsedTime, CRSElapsedTime, Airtime, Arrdelay, Depdelay, Origin,Dest, Distance, Taxiin, Taxiout, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay);.
这将确保您访问csv中分隔的每个字段,','