我有以下格式的数据集:
Country Code Year Value
0 ABC 32 2000 NaN
1 ABC 32 2001 NaN
2 ABC 32 2002 NaN
3 ABC 32 2003 NaN
4 ABC 32 2004 1000000.0
5 ABC 32 2005 NaN
6 ABC 32 2006 NaN
7 ABC 32 2007 NaN
8 ABC 32 2008 NaN
9 ABC 32 2009 NaN
并且我正在尝试以这样一种方式替换NaN值:它们显示出围绕非NaN值的r%的年增长率;换句话说,对于示例数据,Value [i]应该等于1000000 *(1 + r)^ x,其中x是非NaN值的索引与i的索引之间的差。
对于这个小集合,以下代码可以完成工作:
df['imputed'] = ''
gr = 0.05 # growth rate
for i in range(len(df)):
nx = df.Value.first_valid_index() # index of first non-NaN value
nv = df.Value[df.Value.first_valid_index()] # first non-NaN value
df['imputed'][i] = nv * (1+gr) ** (i - nx)
df
Country Code Year Value imputed
0 ABC 32 2000 NaN 822702
1 ABC 32 2001 NaN 863838
2 ABC 32 2002 NaN 907029
3 ABC 32 2003 NaN 952381
4 ABC 32 2004 1000000.0 1e+06
5 ABC 32 2005 NaN 1.05e+06
6 ABC 32 2006 NaN 1.1025e+06
7 ABC 32 2007 NaN 1.15763e+06
8 ABC 32 2008 NaN 1.21551e+06
9 ABC 32 2009 NaN 1.27628e+06
但是,实际数据集具有“国家”和“代码”的多个组合,需要类似的计算(注意:如上所述,这些组合中的每一个只有一个非NaN值)。
如果我使用所有必需的国家/地区代码组合制作了新的df(df2),如何将上述计算应用于主df中的每个匹配组合?请注意,还有许多组合不需要这种计算。
df2
Country Code
0 ABC 32
1 DEF 27
2 GHI 19
答案 0 :(得分:0)
您可以仅处理来自有关国家或地区或其他方面的全部数据中的经过过滤的数据框,然后可以将所有内容附加或合并在一起。我只是在这里介绍方法。随意使用下面的代码,并对其进行量身定制,以得到更优化的解决方案。
代码:
df2 = pd.DataFrame(columns = cols)
df2['Country'] = np.array([(c*10).split() for c in ['ABC ', 'DEF ', 'GHI ']]).ravel()
df2['Code'] = np.array([(c*10).split() for c in ['32 ' , '27 ', '19 ']]).ravel()
df2['Year'] = np.arange(2000,2010).tolist() * 3
df2['Value'] = np.nan
df2.loc[[4,14,24],'Value'] = [1000000.0, 2000000.0, 3000000.0]
# print(df2)
df2.drop('id', axis=1, inplace=True)
df.Value = df.Value.apply(lambda x: np.nan if x == 'NaN' else float(x))
df2['imputed'] = 0
def process(df):
for i in range(len(df)):
nx = df.Value.first_valid_index() # index of first non-NaN value
nv = df.Value.loc[nx] # first non-NaN value
# print(nv,gr,i,nx)
df.loc[i,'imputed'] = nv * ((1+gr) ** (i - nx))
return df
new_df = pd.DataFrame()
for c in df2.Country.unique():
cond = (df2.Country == c)
p_df = df2[cond].copy()
p_df.reset_index(drop=True,inplace=True)
df_ = process(p_df)
new_df = new_df.append(df_, ignore_index=True)
print(new_df)
输出:
Country Code Year Value imputed
0 ABC 32 2000 NaN 8.227025e+05
1 ABC 32 2001 NaN 8.638376e+05
2 ABC 32 2002 NaN 9.070295e+05
3 ABC 32 2003 NaN 9.523810e+05
4 ABC 32 2004 1000000.0 1.000000e+06
5 ABC 32 2005 NaN 1.050000e+06
6 ABC 32 2006 NaN 1.102500e+06
7 ABC 32 2007 NaN 1.157625e+06
8 ABC 32 2008 NaN 1.215506e+06
9 ABC 32 2009 NaN 1.276282e+06
10 DEF 27 2000 NaN 1.645405e+06
11 DEF 27 2001 NaN 1.727675e+06
12 DEF 27 2002 NaN 1.814059e+06
13 DEF 27 2003 NaN 1.904762e+06
14 DEF 27 2004 2000000.0 2.000000e+06
15 DEF 27 2005 NaN 2.100000e+06
16 DEF 27 2006 NaN 2.205000e+06
17 DEF 27 2007 NaN 2.315250e+06
18 DEF 27 2008 NaN 2.431013e+06
19 DEF 27 2009 NaN 2.552563e+06
20 GHI 19 2000 NaN 2.468107e+06
21 GHI 19 2001 NaN 2.591513e+06
22 GHI 19 2002 NaN 2.721088e+06
23 GHI 19 2003 NaN 2.857143e+06
24 GHI 19 2004 3000000.0 3.000000e+06
25 GHI 19 2005 NaN 3.150000e+06
26 GHI 19 2006 NaN 3.307500e+06
27 GHI 19 2007 NaN 3.472875e+06
28 GHI 19 2008 NaN 3.646519e+06
29 GHI 19 2009 NaN 3.828845e+06