Pandas DataFrame-根据其他列的值填充列的NaN

时间:2020-07-24 17:58:00

标签: python pandas dataframe nan

我拥有一个广泛的数据框架,并且使用了几年时间:

df = pd.DataFrame(index=pd.Index([29925, 223725, 280165, 813285, 956765], name='ID'),
                  columns=pd.Index([1991, 1992, 1993, 1994, 1995, 1996, '2010-2012'], name='Year'),
                  data = np.array([[np.NaN, np.NaN, 16, 17, 18, 19, np.NaN],
                                   [16, 17, 18, 19, 20, 21, np.NaN],
                                   [np.NaN, np.NaN, np.NaN, np.NaN, 16, 17, 31],
                                   [np.NaN, 22, 23, 24, np.NaN, 26, np.NaN],
                                   [36, 36, 37, 38, 39, 40, 55]]))

Year     1991  1992  1993  1994  1995  1996  2010-2012
ID                                                    
29925     NaN   NaN  16.0  17.0  18.0  19.0        NaN
223725   16.0  17.0  18.0  19.0  20.0  21.0        NaN
280165    NaN   NaN   NaN   NaN  16.0  17.0       31.0
813285    NaN  22.0  23.0  24.0   NaN  26.0        NaN
956765   36.0  36.0  37.0  38.0  39.0  40.0       55.0

每行中的值是每个人的年龄,每个人都有唯一的ID。我想根据每一行的现有年龄值在每一行的每一年中填写此数据框的NaN

例如,ID 299251993中为16,我们知道它们在1992中为15,在1991中为14,因此我们要替换{{1 {}}和NaN列中29925的}}。同样,我想根据1992的现有年龄值替换列1991中的NaN。假设2010-201229925列中的29925大15岁。对于整个数据帧(即所有ID),最快的方法是什么?

1 个答案:

答案 0 :(得分:2)


# imports we need later
import numpy as np
import pandas as pd

这不是一种特别有效的方法,但它可以工作。我将省略您的最后一篇专栏文章,以使事情更加系统化。

df

df = pd.DataFrame(index=pd.Index([29925, 223725, 280165, 813285, 956765], name='ID'),
                  columns=pd.Index([1992, 1992, 1993, 1994, 1995, 1996], name='Year'),
                  data = np.array([[np.NaN, np.NaN, 16, 17, 18, 19],
                                   [16, 17, 18, 19, 20, 21],
                                   [np.NaN, np.NaN, np.NaN, np.NaN, 16, 17],
                                   [np.NaN, 22, 23, 24, np.NaN, 26],
                                   [35, 36, 37, 38, 39, 40]]))

enter image description here

计算每个人的出生日期:

dob=[]
for irow, row in enumerate(df.iterrows()):
    dob.append(np.asarray([int(each) for each in df.columns]) - np.asarray(df.iloc[irow,:]))

,如果您进入列表comprehensions

dob = [np.asarray([int(each) for each in df.columns]) - np.asarray(df.iloc[irow,:]) for irow, row in enumerate(df.iterrows())]

现在dob像这样:

[array([  nan,   nan, 1977., 1977., 1977., 1977.]),
 array([1976., 1975., 1975., 1975., 1975., 1975.]),
 array([  nan,   nan,   nan,   nan, 1979., 1979.]),
 array([  nan, 1970., 1970., 1970.,   nan, 1970.]),
 array([1956., 1956., 1956., 1956., 1956., 1956.])]

使用np.uniqueremove nans创建一个简单的任务列表:

dob_filtered=[np.unique(each[~np.isnan(each)])[0] for each in dob]

dob_filtered现在看起来像这样:

[1977.0, 1975.0, 1979.0, 1970.0, 1956.0]

Attach将此列表添加到数据框:

df['dob']=dob_filtered

使用NaN列填充df的{​​{1}}:

dob

Delete for irow, row in enumerate(df.index): for icol, col in enumerate(df.columns[:-2]): df.loc[row,col] = col - df['dob'][row] 列(仅用于获取原始列,否则不重要):

dob

获取:

df.drop(['dob'],axis=1)

enter image description here