如何使用python中的年增长率估算缺失值?

时间:2019-08-07 08:14:37

标签: python pandas imputation

我有以下格式的数据集:

            Country Code    Year    Value
        0   ABC     32      2000    NaN
        1   ABC     32      2001    NaN
        2   ABC     32      2002    NaN
        3   ABC     32      2003    NaN
        4   ABC     32      2004    1000000.0
        5   ABC     32      2005    NaN
        6   ABC     32      2006    NaN
        7   ABC     32      2007    NaN
        8   ABC     32      2008    NaN
        9   ABC     32      2009    NaN

并且我正在尝试以这样一种方式替换NaN值:它们显示出围绕非NaN值的r%的年增长率;换句话说,对于示例数据,Value [i]应该等于1000000 *(1 + r)^ x,其中x是非NaN值的索引与i的索引之间的差。

对于这个小集合,以下代码可以完成工作:

df['imputed'] = ''
gr = 0.05 # growth rate

for i in range(len(df)):
    nx = df.Value.first_valid_index() # index of first non-NaN value
    nv = df.Value[df.Value.first_valid_index()] # first non-NaN value
    df['imputed'][i] = nv * (1+gr) ** (i - nx)
df


    Country   Code      Year    Value       imputed
0   ABC       32        2000    NaN         822702
1   ABC       32        2001    NaN         863838
2   ABC       32        2002    NaN         907029
3   ABC       32        2003    NaN         952381
4   ABC       32        2004    1000000.0   1e+06
5   ABC       32        2005    NaN         1.05e+06
6   ABC       32        2006    NaN         1.1025e+06
7   ABC       32        2007    NaN         1.15763e+06
8   ABC       32        2008    NaN         1.21551e+06
9   ABC       32        2009    NaN         1.27628e+06

但是,实际数据集具有“国家”和“代码”的多​​个组合,需要类似的计算(注意:如上所述,这些组合中的每一个只有一个非NaN值)。

如果我使用所有必需的国家/地区代码组合制作了新的df(df2),如何将上述计算应用于主df中的每个匹配组合?请注意,还有许多组合不需要这种计算。

df2
    Country Code
0   ABC     32
1   DEF     27
2   GHI     19

1 个答案:

答案 0 :(得分:0)

您可以仅处理来自有关国家或地区或其他方面的全部数据中的经过过滤的数据框,然后可以将所有内容附加或合并在一起。我只是在这里介绍方法。随意使用下面的代码,并对其进行量身定制,以得到更优化的解决方案。

代码:

df2 = pd.DataFrame(columns = cols)
df2['Country'] = np.array([(c*10).split() for c in ['ABC ', 'DEF ', 'GHI ']]).ravel()
df2['Code'] = np.array([(c*10).split() for c in ['32 ' , '27 ', '19 ']]).ravel()
df2['Year'] = np.arange(2000,2010).tolist() * 3
df2['Value'] = np.nan
df2.loc[[4,14,24],'Value'] = [1000000.0, 2000000.0, 3000000.0]

# print(df2)
df2.drop('id', axis=1, inplace=True)
df.Value = df.Value.apply(lambda x: np.nan if x == 'NaN' else float(x))

df2['imputed'] = 0
def process(df):
    for i in range(len(df)):
        nx = df.Value.first_valid_index() # index of first non-NaN value
        nv = df.Value.loc[nx] # first non-NaN value
        # print(nv,gr,i,nx)
        df.loc[i,'imputed'] = nv * ((1+gr) ** (i - nx))
    return df


new_df = pd.DataFrame()
for c in df2.Country.unique():
    cond = (df2.Country == c)
    p_df = df2[cond].copy()
    p_df.reset_index(drop=True,inplace=True)
    df_ = process(p_df)
    new_df = new_df.append(df_, ignore_index=True)

print(new_df)

输出:

   Country Code  Year      Value       imputed
0      ABC   32  2000        NaN  8.227025e+05
1      ABC   32  2001        NaN  8.638376e+05
2      ABC   32  2002        NaN  9.070295e+05
3      ABC   32  2003        NaN  9.523810e+05
4      ABC   32  2004  1000000.0  1.000000e+06
5      ABC   32  2005        NaN  1.050000e+06
6      ABC   32  2006        NaN  1.102500e+06
7      ABC   32  2007        NaN  1.157625e+06
8      ABC   32  2008        NaN  1.215506e+06
9      ABC   32  2009        NaN  1.276282e+06
10     DEF   27  2000        NaN  1.645405e+06
11     DEF   27  2001        NaN  1.727675e+06
12     DEF   27  2002        NaN  1.814059e+06
13     DEF   27  2003        NaN  1.904762e+06
14     DEF   27  2004  2000000.0  2.000000e+06
15     DEF   27  2005        NaN  2.100000e+06
16     DEF   27  2006        NaN  2.205000e+06
17     DEF   27  2007        NaN  2.315250e+06
18     DEF   27  2008        NaN  2.431013e+06
19     DEF   27  2009        NaN  2.552563e+06
20     GHI   19  2000        NaN  2.468107e+06
21     GHI   19  2001        NaN  2.591513e+06
22     GHI   19  2002        NaN  2.721088e+06
23     GHI   19  2003        NaN  2.857143e+06
24     GHI   19  2004  3000000.0  3.000000e+06
25     GHI   19  2005        NaN  3.150000e+06
26     GHI   19  2006        NaN  3.307500e+06
27     GHI   19  2007        NaN  3.472875e+06
28     GHI   19  2008        NaN  3.646519e+06
29     GHI   19  2009        NaN  3.828845e+06