我试图找到猪关系中两个不同时间字段之间的差异。我可以使用猪的todate()方法但是它应该是hhmm格式。但它没有前导零。例如,如果两个字段的值为1245和1425,我可以找到使用todate转换它们的差异。但是,如果值为945和823,那么我无法使用todate进行转换,因为没有前导零。
但是我写了一个python udf试图将leftpad归零。请找到以下代码
@outputSchema("time:bytearray")
def zero(time):
time = str(time)
if len(time)<= 3:
return '0'+ time
else:
return time
第1步:注册我的python函数
REGISTER '/home/Jig13517/zeropad.py' using jython AS myfuncs ;
请找到以下关系
Airlines_data_schema = LOAD '/user/Jig13517/pigsample/Airlines_data.csv' USING PigStorage('\t') AS (Year,Month,DayofMonth,DayofWeek,DepTime_actual,CRSDeptime,Arrtime_actual,CRSArrtime,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay);
=====================================
然后我尝试用零填充列值左键
airlines_new = FOREACH Airlines_data_schema GENERATE Year,Month,DayofMonth,DayofWeek,myfuncs.zero($4) AS DepTime_actual_new,myfuncs.zero($5) AS CRSDeptime_new,myfuncs.zero($6) AS Arrtime_actual_new,myfuncs.zero($7) AS CRSArrtime_new,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay ;
===============================
应用python udf之后的示例数据
(2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA,,,,None,None,None,None,,,,,,,,,,,,,,,,,,,,,)
但我们可以看到上面没有转换列值。我没有改变相同的字段。请让我知道我的udf有什么问题,或者是否有任何猪方法来完成这项任务。
答案 0 :(得分:0)
str.zfill
功能可以提供帮助
input.txt中
1245
1425
945
823
pig_udfs.py
@outputSchema('time:chararray')
def lpad_time(time):
return time.zfill(4)
time_formatter.pig
register pig_udfs.py using jython as myfuncs;
A = LOAD 'input.txt' USING PigStorage();
B = FOREACH A GENERATE myfuncs.lpad_time((chararray) $0);
\d B
输出
(1245)
(1425)
(0945)
(0823)
显然,你可以让Python自己完成整个todate
函数......
另外,我在你的问题中并不清楚会议记录是否为零填充。
修改强>
airlines.csv
2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA,,,,None,None,None,None,,,,,,,,,,,,,,,,,,,,,
猪代码
register pig_udfs.py using jython as myfuncs;
A = LOAD 'airlines.csv' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS Year, $1 AS Month, $2 AS DayofMonth, $4 AS DayofWeek,myfuncs.lpad_time((chararray) $4) AS DepTime_actual_new,myfuncs.lpad_time((chararray) $5) AS CRSDeptime_new,myfuncs.lpad_time((chararray) $6) AS Arrtime_actual_new,myfuncs.lpad_time((chararray) $7) AS CRSArrtime_new,$8 AS UniqueCarrier,$9 AS FlightNum,$10 AS TailNum_Plane,$11 AS ActualElapsedTime, $12 AS CRSElapsedTime, $13 AS Airtime, $14 AS Arrdelay, $15 AS Depdelay, $16 AS Origin, $17 AS Dest, $18 AS Distance, $19 AS Taxiin, $20 AS Taxiout, $21 AS Cancelled, $22 AS CancellationCode, $23 AS Diverted, $24 AS CarrierDelay, $25 AS WeatherDelay, $26 AS NASDelay, $27 AS SecurityDelay, $28 AS LateAircraftDelay ;
\d B
输出
(2008,1,3,617,0617,0615,0652,0650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA)
答案 1 :(得分:0)
嘿@ cricket_007我得到了它的工作。我将列字段作为bytearray传递,这是我正在做的错误。然后,当我将模式更改为chararray时,它开始填充零。非常感谢。 请在下面找到更正后的记录:
(2008,1,3,4,0617,0615,0652,0650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,0 ,NA,NA,NA,NA,NA) (2008,1,3,4,0628,0620,0804,0750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA, NA,NA,NA,NA)