使用python udf改进pig脚本的性能

时间:2016-05-09 09:25:22

标签: python-2.7 apache-pig udf

以下是pig(0.15)脚本,用于将输入文件(cdrs作为别名)映射到其他文件(mastergt as alias)&它正在调用一个python(2.7.11)udf来映射相同的内容,这对于说4.5K记录需要40分钟。你能否提出改进建议。

Pig Script:

REGISTER 'smsiuc_udf.py' using streaming_python as smsiuc_udfs;

cdrs = load '2016040111*' USING PigStorage('|','-tagFile') ;

mastergtrec = load 'master.txt' USING PigStorage(',','-tagFile');

mastergt = FOREACH mastergtrec GENERATE (chararray) UPPER($1) as opcdpc, (chararray) UPPER($2) as gtoptname,(chararray) UPPER($3) as gtoptcircle;

cdrrecord = FOREACH cdrs GENERATE (chararray) UPPER($1) as aparty, (chararray) UPPER($2) as bparty,$3 as smssentdate,$4 as smssenttime,($29=='6' ? 'S' : 'F') as status,(chararray) UPPER($26) as srcgt,(chararray) UPPER($27) as destgt,($12=='405899136999995' ? 'MTSDEL-CDMA' : ($12=='919875089998' ? 'MTSRAJ-GSM' : ($12=='405899150999995' ? 'MTSCHN-CDMA' : $12) ) ) as smscgt, (chararray)$0 as cdrfname,(chararray) $13 as prepost;

filteredp2pcdrs = FILTER cdrrecord by smsiuc_udfs.pullp2pcdrs(aparty,bparty,srcgt,destgt) and status == 'S' and SUBSTRING(smssentdate,4,6) == '$MON';

groupp2pcdrs = GROUP filteredp2pcdrs by (srcgt,destgt,aparty,bparty,smscgt,status,prepost);

distinctp2pcdrs= FOREACH groupp2pcdrs {
    uniq = DISTINCT filteredp2pcdrs.(srcgt,destgt,aparty,bparty,smscgt,status,prepost);
    GENERATE FLATTEN(group),COUNT(uniq) as cnt;
    };

 p2preportmap = FOREACH distinctp2pcdrs GENERATE smsiuc_udfs.p2preport(srcgt,destgt,aparty,bparty),smscgt,status,prepost,cnt

Python UDF如下:

    def p2preport(srcgt,destgt,aparty,bparty):
    mastergt = {}
    masterlrn = {}
    origno = str(int(aparty))
    destno = str(int(bparty))
    returnstring = []
    try:
            if ((os.path.isfile(MASTERLRN) and os.access(MASTERLRN, os.R_OK) and os.stat(MASTERLRN).st_size > 0) and (os.path.isfile(MASTERGT) and os.access(MASTERGT, os.R_OK) and os.stat(MASTERGT).st_size > 0)):

                    #READ CONTENTS OF MASTER GT/LRN IN BAG/DICT
                    mastergt = readfileinbag(MASTERGT,1)
                    masterlrn = readfileinbag(MASTERLRN,2)
                    mastergtcircle = readfileinbag(MASTERGT,2)

                    if(srcgt in mastergt):
                            returnstring = mastergt[srcgt]
                    elif(srcgt[0:9] in mastergt):
                            returnstring = mastergt[srcgt[0:9]]
                    elif(srcgt[0:8] in mastergt):
                            returnstring = mastergt[srcgt[0:8]]
                    elif(srcgt[0:7] in mastergt):
                            returnstring = mastergt[srcgt[0:7]]
                    elif(srcgt[0:6] in mastergt):
                            returnstring = mastergt[srcgt[0:6]]
                    elif(srcgt[0:5] in mastergt):
                            returnstring = mastergt[srcgt[0:5]]
                    elif(srcgt[0:4] in mastergt):
                            returnstring = mastergt[srcgt[0:4]]
                    else:
                            returnstring = mastergt.get(srcgt,srcgt+",")

                    if destgt in mastergt:
                            returnstring = returnstring + "," + mastergt[destgt]
                    elif(destgt[0:9] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:9]]
                    elif(destgt[0:8] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:8]]
                    elif(destgt[0:7] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:7]]
                    elif(destgt[0:6] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:6]]
                    elif(destgt[0:5] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:5]]
                    elif(destgt[0:4] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:4]]
                    else:
                            returnstring = returnstring + mastergt.get(destgt,destgt+",")

   return returnstring

   except AttributeError:
            pass

0 个答案:

没有答案