到目前为止,我得到的是下面的代码,它运行正常并带来应该得到的结果:如果没有给出df['c']
,它会使用previous c * b
填充c
。问题是我必须将它应用于更大的数据集len(df.index) = ca. 10.000
,因此我到目前为止所使用的函数是不合适的,因为我必须写几千次:df['c'] = df.apply(func, axis =1)
。对于此大小的数据集,while
循环不是pandas
中的选项。有什么想法吗?
import pandas as pd
import numpy as np
import datetime
randn = np.random.randn
rng = pd.date_range('1/1/2011', periods=10, freq='D')
df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng)
df["c"] =np.NaN
df["c"][0] = 1
df["c"][2] = 3
def func(x):
if pd.notnull(x['c']):
return x['c']
else:
return df.iloc[df.index.get_loc(x.name) - 1]['c'] * x['b']
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
df['c'] = df.apply(func, axis =1)
答案 0 :(得分:4)
这是一种解决重现问题的好方法。在v0.16.2(下周发布)中将有关于此的文档。请参阅numba
的文档这将是非常高效的,因为真正繁重的工作是在快速的jit-ted编译代码中完成的。
import pandas as pd
import numpy as np
from numba import jit
rng = pd.date_range('1/1/2011', periods=10, freq='D')
df = pd.DataFrame({'a': np.nan * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng)
df.ix[0,"c"] = 1
df.ix[2,"c"] = 3
@jit
def ffill(arr_b, arr_c):
n = len(arr_b)
assert len(arr_b) == len(arr_c)
result = arr_c.copy()
for i in range(1,n):
if not np.isnan(arr_c[i]):
result[i] = arr_c[i]
else:
result[i] = result[i-1]*arr_b[i]
return result
df['d'] = ffill(df.b.values, df.c.values)
a b c d
2011-01-01 NaN 2 1 1
2011-01-02 NaN 3 NaN 3
2011-01-03 NaN 10 3 3
2011-01-04 NaN 3 NaN 9
2011-01-05 NaN 5 NaN 45
2011-01-06 NaN 8 NaN 360
2011-01-07 NaN 4 NaN 1440
2011-01-08 NaN 1 NaN 1440
2011-01-09 NaN 2 NaN 2880
2011-01-10 NaN 6 NaN 17280
答案 1 :(得分:4)
如果在for循环中打印出df
的值:
for i in range(7):
df['c'] = df.apply(func, axis =1)
print(df)
您可以在c
列中跟踪值的来源:
a b c
2011-01-01 None 2 1 1
2011-01-02 None 3 3 3*1
2011-01-03 None 10 3 1*3*1
2011-01-04 None 3 9 3*1*3*1
2011-01-05 None 5 45 5*3*1*3*1
2011-01-06 None 8 360 ...
2011-01-07 None 4 1440 ...
2011-01-08 None 1 1440 ...
2011-01-09 None 2 2880 ...
2011-01-10 None 6 17280 6*2*4*8*5*3*3
您可以清楚地看到这些值来自累积产品。
每行是前一行的值乘以一些新数字。
该新号码有时来自b
,或有时为1(当c
不是NaN时)。
因此,如果我们可以创建一个包含这些“新”数字的列d
,那么可以通过cumprod
计算所需的值:
df['c'] = df['d'].cumprod()
import pandas as pd
import numpy as np
import datetime
randn = np.random.randn
def setup_df():
rng = pd.date_range('1/1/2011', periods=10, freq='D')
df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},
index=rng)
df["c"] = np.NaN
df.iloc[0, -1] = 1
df.iloc[2, -1] = 3
return df
df = setup_df()
df['d'] = df['b']
mask = pd.notnull(df['c'])
df.loc[mask, 'd'] = 1
df['c'] = df['d'].cumprod()
print(df)
产量
a b c d
2011-01-01 None 2 1 1
2011-01-02 None 3 3 3
2011-01-03 None 10 3 1
2011-01-04 None 3 9 3
2011-01-05 None 5 45 5
2011-01-06 None 8 360 8
2011-01-07 None 4 1440 4
2011-01-08 None 1 1440 1
2011-01-09 None 2 2880 2
2011-01-10 None 6 17280 6
我离开了d
列,以帮助显示c
值的来源。
您当然可以使用
del df['d']
或者更好的是,正如chrisaycock指出的那样,你可以放弃定义d
列完全使用
df['c'] = np.where(pd.notnull(df['c']), 1, df['b']).cumprod()
答案 2 :(得分:1)
你可以写一个这样的写循环:
for i in range(1, len(df)):
if pd.isnull(df.c[i]):
df.c[i] = df.c[i-1] * df.b[i]
如果这对您来说太长了,您可以使用numba jit
。您的示例DataFrame太小,无法在我的系统上进行有意义的测试。