将数据帧输出传递给udf pandas

时间:2017-09-22 21:02:20

标签: python pandas

我有一个带有一些UDF的脚本,它主要使用列表推导来更改数据帧:

def createclaimfields(field,master):
    print 'creating unique matter ids for {} at {}'.format(field,getdt())
    dfcol = ['MATTER ID',field]
    df = master[dfcol].dropna().drop_duplicates()
    print 'created unique matter ids for {} at {}'.format(field,getdt())

    print 'started getting CLASS HYBRID claims for {} at {}'.format(field,getdt())
    df['{} CLASS HYBRID CLM NO'.format(field)]=[getclasshybrid(clm) for clm in df[field]]
    print 'finished getting CLASS HYBRID claims for {} at {}. Found {} matches'.format(field,getdt(),len(df['{} CLASS HYBRID CLM NO'.format(field)]))

    print 'started getting HRV claims for {} at {}'.format(field,getdt())
    df['{} HRV CLM NO'.format(field)]=[gethrv(clm) for clm in df[field]]
    print 'finished getting HRV claims for {} at {}. Found {} matches'.format(field,getdt(),len(df['{} HRV CLM NO'.format(field)]))

    print 'started getting CC claims for {} at {}'.format(field,getdt())
    df['{} CC CLM NO'.format(field)]=[getcc(clm) for clm in df[field]]
    print 'finished getting CC claims for {} at {}. Found {} matches'.format(field,getdt(),len(df['{} CC CLM NO'.format(field)]))

    print 'started getting PASS claims for {} at {}'.format(field,getdt())
    df['{} PASS CLM NO'.format(field)]=[getpass(clm) for clm in df[field]]
    print 'finished getting PASS claims for {} at {}. Found {} matches'.format(field,getdt(),len(df['{} PASS CLM NO'.format(field)]))

    print 'merging {} into claimfields at {}'.format(field,getdt())
    master = master.merge(df,how='left',on=['MATTER ID',field])
    print 'merged {} into claimfields at {}'.format(field,getdt())

    return master

fieldlist = ['MATTER NUMBER','MATTER NAME','CLAIM NUMBER LISTING'] 
mattercol = ['MATTER NUMBER','MATTER NAME','CLAIM NUMBER LISTING','MATTER ID']
claimfields = rawtrans[mattercol].dropna().drop_duplicates().head()

[createclaimfields(field,claimfields) for field in fieldlist]

不幸的是,当我在运行之后调用claimfields时,我得到了没有添加列的原始输出。我猜这是因为'claimfields'调用函数'rawtrans [mattercol] .dropna()。drop_duplicates()。head()'而不是该函数调用的实际输出。如何将claimfields定义为它自己的对象而不是源自'rawtrans'df的命令链?

谢谢!

编辑::问题解决了!我用以下内容替换了[createclaimfields(field,claimfields) for field in fieldlist]

for field in fieldlist:
    claimfields=createclaimfields(field,claimfields)

tl; dr我没有正确分配输出数据帧,而且我也不需要使用list comp来遍历fieldlist中的每个字段。

编辑#2 - 样本UDF

def getcc(clm):
    zlist=range(len(clm))

    #create list of prefixes from letterlist and numberlist
    prefixlist = ['AA','AB','AC','AD','AE','AF','GA','GB','GC','GD','GE','GF','ZZ']

    # list of all 20 length substrings for list comprehension below
    clmstrs=[x for x in [clm[z:z+8] for z in zlist] if (len(x)==8) & (any(p in x[:-2] for p in prefixlist)) & sum(c.isalpha() for c in x)==2]

    if (len(clmstrs)> 0):
        return clmstrs[0]
    else:
        return np.nan    

0 个答案:

没有答案