Python熊猫-向数据框添加多列,其中行和列是相互依赖的

时间:2018-11-13 11:48:38

标签: python pandas performance dataframe

我已经设法使它与for循环一起使用,但是在我正在使用的大型数据集上,这非常慢,所以我试图找到一种方法来使用pandas,groupby,apply和lamda函数。

import pandas as pd
example_df = pd.DataFrame({"scen": [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2], 
                      "cusip": ['031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2', '031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2'], 
                      "wal": [50, 55, 60, 65, 40, 50, 60, 70, 40, 45, 50, 55, 30, 40, 50, 60],
                      "par_val": [900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000],
                      "prin_cf": [0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000],
                      "amortization": [166.67, 0, 0, 0, 208.33, 0, 0, 0, 208.33, 0, 0, 0, 277.78, 0, 0, 0],
                      "book_val": [1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0]})

for x in range(1, len(example_df['scen'])):

if (example_df['cusip'][x] == example_df['cusip'][x-1]):

# If bond matures, don't report book value
    if(example_df['par_val'][x] == 0):
        example_df['book_val'][x] = 0
    else:
        example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


    example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12

example_df

棘手的部分是每一行的帐面价值取决于上一行的摊销价值,而每个摊销值取决于同一行中的帐面价值。从这里对类似问题的回答来看,我认为也许可以使用跟踪先前值的全局变量来做到这一点。

Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?

类似的东西:

def calc_bv(prin_cf, par_val, wal):
global bvalue, amort 
bvalue = bvalue - amort - prin_cf
amort = (bvalue - par_val)/wal/12
return bvalue, amort

bvalue = example_df.loc[0, 'book_val']
amort = example_df.loc[0, 'amortization']
example_df[1:][['book_val','amortization']] = example_df2[1:].apply(lambda row: calc_bv(row['prev_prin_cf'],row['par_val'],row['wal']), axis=1, result_type="expand")
example_df

1 个答案:

答案 0 :(得分:0)

毫无疑问,将有一个基于groupby的智能熊猫解决方案。但是只需使用numba重写循环,您就可以获得大约1000倍的不错的性能改善。

# Python 3.6.0, Pandas 0.19.2

assert jpp(df).equals(original(df))

%timeit jpp(df)       # 929 µs per loop
%timeit original(df)  # 1.05 s per loop

基准代码

原文:

def original(example_df):
    for x in range(1, len(example_df['scen'])):

        if (example_df['cusip'][x] == example_df['cusip'][x-1]):

        # If bond matures, don't report book value
            if(example_df['par_val'][x] == 0):
                example_df['book_val'][x] = 0
            else:
                example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


            example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
    return example_df

Numba:

from numba import njit

@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
    n = len(par)
    for i in range(1, n):
        if cusip[i] == cusip[i-1]:
            if par[i] == 0:
                book[i] == 0
            else:
                book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
            amort[i] = (book[i] - par[i]) / wal[i] / 12
    return book, amort


def jpp(df):
    df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
                                                    df['book_val'].values, df['amortization'].values,
                                                    df['prin_cf'].values, df['wal'].values)

    return df