将功能应用于GroupBy Pandas数据框时出现Iterrows错误

时间:2019-07-14 20:01:01

标签: python pandas pandas-groupby

我正在处理像这样的熊猫数据框

ID  have        time
1   NaN     2010-07-01
1   1       2010-07-08
1   5       2011-07-08
1   NaN     2011-08-08
1   NaN     2012-05-08
1   NaN     2012-09-08
1   1       2012-10-08
2   NaN     2013-01-18
2   1       2013-02-18
2   NaN     2013-03-18

我想用ID组(个人)替换缺失值,并且仅在一年之内用个人级别的非缺失值替换记录:

ID    have  want    time
1     NaN   NaN     2010-07-01
1     1     1       2010-07-08
1     5     5       2011-07-08
1     NaN   5       2011-08-08
1     NaN   5       2012-05-08
1     NaN   NaN     2012-09-08
1     1     1       2012-10-08
2     NaN   NaN     2013-01-18
2     1     1       2013-02-18
2     NaN   1       2013-03-18

有什么有效的方法可以做到这一点吗?

我正在使用以下似乎在每一行都有效的代码

df = pd.DataFrame([
    [1.0, np.nan, np.nan, "2010-07-01"],
    [1.0,"1",  "1", "2010-07-08"],
    [1.0,"5",  "5", "2011-07-08"],
    [1.0,np.nan, "5", "2011-08-08"],
    [1.0, np.nan, "5", "2012-05-08"],
    [1.0, np.nan,np.nan,  "2012-09-08"],
    [1.0,"1",   "1",  "2012-10-08"],
    [2.0, np.nan, np.nan, "2013-01-18"],
    [2.0, "1",    "1", "2013-02-18"],
    [2.0, np.nan, "1", "2013-03-18"]
    ], columns = ['ID', 'have', 'want', 'time'])
df['time']=pd.to_datetime(df['time'], format='%Y-%m-%d')

def want(df):
    for ind, row in df.iterrows():
        df.loc[ind,'ewant']=df.loc[ind,'edatum']
        if ind != 0:
            if pd.isnull(df.loc[ind,'dosage']) == 1:
                temp = ind - 1
                df.loc[ind,'ewant']=df.loc[temp,'ewant']
            else:
                pass
        else:
            pass
        df.loc[ind,'timespan']=(df.loc[ind,'edatum'] - df.loc[ind,'ewant']).days
        df.loc[ind,'impu']=np.where( 0< (df.loc[ind,'edatum'] - df.loc[ind,'ewant']).days <= 365 , 1, 0)

    return df

want(df)

但是当我尝试将其应用于“ ID”组级别

want(df.groupby(['ID']))

我遇到了这个迭代错误:

AttributeError: Cannot access callable attribute 'iterrows' of 'DataFrameGroupBy' objects, try using the 'apply' method

是否有某种方法可以解决此迭代错误?谢谢!

1 个答案:

答案 0 :(得分:0)

这是完美的解决方法merge_asof

df1=df.dropna()
df=pd.merge_asof(df,df1,by='ID',on='time',tolerance=pd.Timedelta(12, unit='M'))
df#have_y is the column you want 
   ID  have_x       time  have_y
0   1     NaN 2010-07-01     NaN
1   1     1.0 2010-07-08     1.0
2   1     5.0 2011-07-08     5.0
3   1     NaN 2011-08-08     5.0
4   1     NaN 2012-05-08     5.0
5   1     NaN 2012-09-08     NaN
6   1     1.0 2012-10-08     1.0
7   2     NaN 2013-01-18     NaN
8   2     1.0 2013-02-18     1.0
9   2     NaN 2013-03-18     1.0