我想使用python udf

时间:2017-03-03 13:25:20

标签: python python-2.7 apache-pig

我按照以下步骤zeropad.py我的python脚本

!/usr/bin/python

from org.apache.pig.scripting import *

@outputSchema('time:int')

def zero():
    time.zfill(4)

=======================================

grunt> REGISTER' zeropad.py'使用org.apache.pig.scripting.jython.JythonScriptEngine作为myfuncs;

==============================

Airlines_data_schema = LOAD 'AirlinesData_sample-1.csv' USING PigStorage('\t') AS (Year,Month,DayofMonth,DayofWeek,DepTime_actual:int,CRSDeptime:int,Arrtime_actual:int,CRSArrtime:int,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay);

=============================================== ====

 airlines_new = FOREACH Airlines_data_schema GENERATE Year,Month,DayofMonth,DayofWeek,myfuncs.zero.DepTime_actual AS DepTime_actual_new,myfuncs.zero.CRSDeptime AS CRSDeptime_new,myfuncs.zero.Arrtime_actual AS Arrtime_actual_new,myfuncs.zero.CRSArrtime AS CRSArrtime_new,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay ;

我收到以下错误

2017-02-26 19:37:19,606 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:

无效的字段投影。模式中不存在投影字段[myfuncs]:Year:bytearray,Month:bytearray,DayofMonth:bytearray,DayofWeek:bytearray,DepTime_actual:int,CRSDeptime:int,Arrtime_actual:int,CRSArrtime:int,UniqueCarrier:bytearray,FlightNum:bytearray ,TailNum_Plane:字节组,ActualElapsedTime:字节组,CRSElapsedTime:ByteArray的,通话时间:字节组,Arrdelay:字节组,Depdelay:字节组,产地:ByteArray的,目的地:字节组,距离:字节组,Taxiin:字节组,Taxiout:ByteArray的,取消:字节组,CancellationCode :字节组,改行:字节组,CarrierDelay:字节组,WeatherDelay:字节组,NASDelay:字节组,SecurityDelay:字节组,LateAircraftDelay:字节组

想知道为什么我无法使用我的python函数来操纵我的列值

2 个答案:

答案 0 :(得分:0)

尝试使用以下语法:

airlines_new = FOREACH Airlines_data_schema GENERATE Year,Month,DayofMonth,DayofWeek, myfuncs.zero(DepTime_actual) AS DepTime_actual_new,myfuncs.zero.CRSDeptime AS CRSDeptime_new,myfuncs.zero.Arrtime_actual AS Arrtime_actual_new,myfuncs.zero.CRSArrtime AS CRSArrtime_new,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay ;

答案 1 :(得分:0)

搞定了!通过下面的小修正

#!/usr/bin/python

@outputSchema("num:int")

def zero(time):
        return time.zfill(4);


REGISTER '/home/Jig13517/zeropad.py' using jython AS func ;


airlines_new = FOREACH Airlines_data_schema GENERATE Year,Month,DayofMonth,DayofWeek,func.zero(Airlines_data_schema.DepTime_actual) AS DepTime_actual_new:int,func.zero(Airlines_data_schema.CRSDeptime) AS CRSDeptime_new:int,func.zero(Airlines_data_schema.Arrtime_actual) AS Arrtime_actual_new:int,func.zero(Airlines_data_schema.CRSArrtime) AS CRSArrtime_new:int,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay ;