向量化跨数据框列运行的函数

时间:2019-05-21 20:27:15

标签: python pandas

一些样式化的数据开头:

testdf = pd.DataFrame(data = [(1, 'AA', 'ServiceA'), (2, 'BB', 'ServiceB'), (3, 'CC', 'ServiceA'), (4, 'DD', 'ServiceD')], 
                      columns=['Rev', 'Pnum', 'Service'])
   Rev  Pnum    Service
0   1   AA      ServiceA
1   2   BB      ServiceB
2   3   CC      ServiceA
3   4   DD      ServiceD

要分配服务的价值,我们要:

pnumlist = ['AA', 'CC']
servicelist = ['ServiceA', 'ServiceB', 'ServiceC', 'ServiceD']

我正在尝试编写一个比df更高的Pythonic函数,并根据以下内容返回另一个df:

testdf['Charge'] = testdf['Rev'] if testdf['Pnum'] in pnumlist else 0 #doesn't work, throws truth value ambiguous error

返回的df还应该在testdf的每一行中都有用于列各种服务计数的列,因此它应该类似于:

outputdf = pd.DataFrame(data = [(1, 1, 0, 0, 0), (0, 0, 1, 0, 0), (3, 1, 0, 0, 0), (0, 0, 0, 0, 1)],
                       columns = ['Charge', 'Acount', 'Bcount', 'Ccount', 'Dcount'])

此刻,我有一个处理testdf每行的rowhandler函数,然后通过传递rowhandlder func来调用带有此df的apply:

def rowhandler(testdfrow: tuple) -> tuple:
    testdfrow['Charge'] = testdfrow['Rev'] if testdfrow['Pnum'] in pnumlist else 0
    for service in servicelist:
        testdfrow['{}count'.format(service)] = 1 if service in testdfrow['Service'] else 0
    return testdfrow

newcolslist = ['Charge']
newcolsdict = {col: 0 for col in newcolslist}
testdf = testdf.assign(**newcolsdict) #pre-allocating memory speeds up program
testdf = testdf.apply(rowhandler, axis = 1)

在实际情况下,行处理程序函数还有其他几列,并且数据大小也很大。因此,我正在寻找加快速度的方法,并且我认为可以通过对行处理函数进行矢量化来实现。任何建议表示赞赏,谢谢

2 个答案:

答案 0 :(得分:1)

get_dummiesconcat是您所需要的吗?

s1=testdf[['Rev']].where(testdf.Pnum.isin(pnumlist),0)
s2=testdf['Service'].where(testdf['Service'].isin(servicelist)).str.get_dummies()
df=pd.concat([s1,s2.reindex(columns=servicelist,fill_value=0)],1)
df
Out[563]: 
   Rev  ServiceA  ServiceB  ServiceC  ServiceD
0    1         1         0         0         0
1    0         0         1         0         0
2    3         1         0         0         0
3    0         0         0         0         1

答案 1 :(得分:0)

您只需使用基于列的操作就可以编辑数据框。例如:

                    return authToken.create(on: req).flatMap(to: LoginResponse.self){ auth in
                        var lr       = LoginResponse(state: .success)
                        lr.authToken = auth.token
                        return lr
                    }

以下是一些性能比较:

testdf["Charge"] = testdf["Rev"].where(testdf["Pnum"].isin(pnumlist), 0)

for service in servicelist:
    testdf["{}_count".format(service)] = testdf["Service"].str.contains(service).astype(int)

似乎有很大的进步。

编辑: 我将答案更新为性能更高的答案