在python中自动填充计算功能

时间:2015-06-06 00:52:08

标签: python pandas

到目前为止,我得到的是下面的代码,它运行正常并带来应该得到的结果:如果没有给出df['c'],它会使用previous c * b填充c。问题是我必须将它应用于更大的数据集len(df.index) = ca. 10.000,因此我到目前为止所使用的函数是不合适的,因为我必须写几千次:df['c'] = df.apply(func, axis =1)。对于此大小的数据集,while循环不是pandas中的选项。有什么想法吗?

import pandas as pd
import numpy as np
import datetime

randn = np.random.randn
rng = pd.date_range('1/1/2011', periods=10, freq='D')

df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng)
df["c"] =np.NaN

df["c"][0] = 1
df["c"][2] = 3


def func(x):
    if pd.notnull(x['c']):
        return x['c']
    else:
        return df.iloc[df.index.get_loc(x.name) - 1]['c'] * x['b']

df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)

3 个答案:

答案 0 :(得分:4)

这是一种解决重现问题的好方法。在v0.16.2(下周发布)中将有关于此的文档。请参阅numba

的文档

这将是非常高效的,因为真正繁重的工作是在快速的jit-ted编译代码中完成的。

import pandas as pd
import numpy as np
from numba import jit

rng = pd.date_range('1/1/2011', periods=10, freq='D')
df = pd.DataFrame({'a': np.nan * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng)
df.ix[0,"c"] = 1
df.ix[2,"c"] = 3

@jit
def ffill(arr_b, arr_c):

    n = len(arr_b)
    assert len(arr_b) == len(arr_c)
    result = arr_c.copy()

    for i in range(1,n):
        if not np.isnan(arr_c[i]):
            result[i] = arr_c[i]
        else:
            result[i] = result[i-1]*arr_b[i]

    return result

df['d'] = ffill(df.b.values, df.c.values)

             a   b   c      d
2011-01-01 NaN   2   1      1
2011-01-02 NaN   3 NaN      3
2011-01-03 NaN  10   3      3
2011-01-04 NaN   3 NaN      9
2011-01-05 NaN   5 NaN     45
2011-01-06 NaN   8 NaN    360
2011-01-07 NaN   4 NaN   1440
2011-01-08 NaN   1 NaN   1440
2011-01-09 NaN   2 NaN   2880
2011-01-10 NaN   6 NaN  17280

答案 1 :(得分:4)

如果在for循环中打印出df的值:

for i in range(7):
    df['c'] = df.apply(func, axis =1)
    print(df)

您可以在c列中跟踪值的来源:

               a   b      c
2011-01-01  None   2      1    1
2011-01-02  None   3      3    3*1
2011-01-03  None  10      3    1*3*1
2011-01-04  None   3      9    3*1*3*1
2011-01-05  None   5     45    5*3*1*3*1
2011-01-06  None   8    360    ...
2011-01-07  None   4   1440    ...
2011-01-08  None   1   1440    ...
2011-01-09  None   2   2880    ...
2011-01-10  None   6  17280    6*2*4*8*5*3*3

您可以清楚地看到这些值来自累积产品。 每行是前一行的值乘以一些新数字。 该新号码有时来自b,或有时为1(当c不是NaN时)。

因此,如果我们可以创建一个包含这些“新”数字的列d,那么可以通过cumprod计算所需的值:

df['c'] = df['d'].cumprod() 
import pandas as pd
import numpy as np
import datetime

randn = np.random.randn

def setup_df():
    rng = pd.date_range('1/1/2011', periods=10, freq='D')
    df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},
                      index=rng)
    df["c"] = np.NaN
    df.iloc[0, -1] = 1
    df.iloc[2, -1] = 3
    return df

df = setup_df()
df['d'] = df['b']
mask = pd.notnull(df['c'])
df.loc[mask, 'd'] = 1
df['c'] = df['d'].cumprod()
print(df)

产量

               a   b      c  d
2011-01-01  None   2      1  1
2011-01-02  None   3      3  3
2011-01-03  None  10      3  1
2011-01-04  None   3      9  3
2011-01-05  None   5     45  5
2011-01-06  None   8    360  8
2011-01-07  None   4   1440  4
2011-01-08  None   1   1440  1
2011-01-09  None   2   2880  2
2011-01-10  None   6  17280  6

我离开了d列,以帮助显示c值的来源。 您当然可以使用

删除该列
del df['d']

或者更好的是,正如chrisaycock指出的那样,你可以放弃定义d 列完全使用

df['c'] = np.where(pd.notnull(df['c']), 1, df['b']).cumprod()

答案 2 :(得分:1)

你可以写一个这样的写循环:

for i in range(1, len(df)):
    if pd.isnull(df.c[i]):
        df.c[i] = df.c[i-1] * df.b[i]

如果这对您来说太长了,您可以使用numba jit。您的示例DataFrame太小,无法在我的系统上进行有意义的测试。