当缺少多天数据时,使用NaN填充数据帧

时间:2017-09-19 17:41:52

标签: python pandas group-by interpolation pandas-groupby

我有一个pandas数据帧,我进行插值以获取每日数据帧。原始数据框如下所示:

               col_1      vals 
2017-10-01  0.000000  0.112869 
2017-10-02  0.017143  0.112869 
2017-10-12  0.003750  0.117274 
2017-10-14  0.000000  0.161556 
2017-10-17  0.000000  0.116264   

在插值数据帧中,我想将数据值更改为NaN,其中日期差距超过5天。例如。在上面的数据框中,2017-10-022017-10-12之间的差距超过5天,因此在插值数据框中,应删除这两个日期之间的所有值。我不知道怎么做,也许combine_first

- 编辑:插值数据框如下所示:

            col_1      vals 
2017-10-01  0.000000  0.112869 
2017-10-02  0.017143  0.112869 
2017-10-03  0.015804  0.113309 
2017-10-04  0.014464  0.113750 
2017-10-05  0.013125  0.114190 
2017-10-06  0.011786  0.114631 
2017-10-07  0.010446  0.115071 
2017-10-08  0.009107  0.115512 
2017-10-09  0.007768  0.115953 
2017-10-10  0.006429  0.116393 
2017-10-11  0.005089  0.116834 
2017-10-12  0.003750  0.117274 
2017-10-13  0.001875  0.139415 
2017-10-14  0.000000  0.161556 
2017-10-15  0.000000  0.146459 
2017-10-16  0.000000  0.131361 
2017-10-17  0.000000  0.116264

预期产出:

               col_1      vals
2017-10-01  0.000000  0.112869
2017-10-02  0.017143  0.112869
2017-10-12  0.003750  0.117274
2017-10-13  0.001875  0.139415
2017-10-14  0.000000  0.161556
2017-10-15  0.000000  0.146459
2017-10-16  0.000000  0.131361
2017-10-17  0.000000  0.116264

4 个答案:

答案 0 :(得分:10)

我首先要确定缺口超过5天的位置。从那里,我生成一个数组,确定这些差距之间的组。最后,我使用groupby转到每日频率和插值。

# convenience: assign string to variable for easier access
daytype = 'timedelta64[D]'

# define five days for use when evaluating size of gaps
five = np.array(5, dtype=daytype)

# get the size of gaps
deltas = np.diff(df.index.values).astype(daytype)

# identify groups between gaps
groups = np.append(False, deltas > five).cumsum()

# handy function to turn to daily frequency and interpolate
to_daily = lambda x: x.asfreq('D').interpolate()

# and finally...
df.groupby(groups, group_keys=False).apply(to_daily)

               col_1      vals
2017-10-01  0.000000  0.112869
2017-10-02  0.017143  0.112869
2017-10-12  0.003750  0.117274
2017-10-13  0.001875  0.139415
2017-10-14  0.000000  0.161556
2017-10-15  0.000000  0.146459
2017-10-16  0.000000  0.131361
2017-10-17  0.000000  0.116264

如果您想提供自己的插值方法。您可以像这样修改上述内容:

daytype = 'timedelta64[D]'
five = np.array(5, dtype=daytype)
deltas = np.diff(df.index.values).astype(daytype)
groups = np.append(False, deltas > five).cumsum()

# custom interpolation function that takes a dataframe
def my_interpolate(df):
    """This can be whatever you want.
    I just provided what will result
    in the same thing as before."""
    return df.interpolate()

to_daily = lambda x: x.asfreq('D').pipe(my_interpolate)

df.groupby(groups, group_keys=False).apply(to_daily)

               col_1      vals
2017-10-01  0.000000  0.112869
2017-10-02  0.017143  0.112869
2017-10-12  0.003750  0.117274
2017-10-13  0.001875  0.139415
2017-10-14  0.000000  0.161556
2017-10-15  0.000000  0.146459
2017-10-16  0.000000  0.131361
2017-10-17  0.000000  0.116264

答案 1 :(得分:1)

这是你想要的吗?

data0 = """2017-10-01  0.000000  0.112869 
2017-10-02  0.017143  0.112869 
2017-10-12  0.003750  0.117274 
2017-10-14  0.000000  0.161556 
2017-10-17  0.000000  0.116264"""
data = [row.split('  ') for row in data0.split('\n')]

df = pd.DataFrame(data, columns = ['date','col_1','vals'])
df.date = pd.to_datetime(df.date)
last_observation = df.assign(last_observation = df.date.diff().dt.days)
df.set_index(['date'], inplace = True)

all_dates = pd.date_range(start = last_observation.date.min(), 
                          end = last_observation.date.max())
df_interpolated = df.reindex(all_dates).astype(np.float64).interpolate()
df_interpolated = df_interpolated.join(last_observation.set_index('date').last_observation)
df_interpolated['discard'] = (df_interpolated.last_observation.bfill() > 5) & df_interpolated.last_observation.isnull()
df_interpolated[['col_1','vals']] = df_interpolated[['col_1','vals']].where(~df_interpolated.discard)

输出结果为:

               col_1      vals  last_observation  discard
2017-10-01  0.000000  0.112869               NaN    False
2017-10-02  0.017143  0.112869               1.0    False
2017-10-03       NaN       NaN               NaN     True
2017-10-04       NaN       NaN               NaN     True
2017-10-05       NaN       NaN               NaN     True
2017-10-06       NaN       NaN               NaN     True
2017-10-07       NaN       NaN               NaN     True
2017-10-08       NaN       NaN               NaN     True
2017-10-09       NaN       NaN               NaN     True
2017-10-10       NaN       NaN               NaN     True
2017-10-11       NaN       NaN               NaN     True
2017-10-12  0.003750  0.117274              10.0    False
2017-10-13  0.001875  0.139415               NaN    False
2017-10-14  0.000000  0.161556               2.0    False
2017-10-15  0.000000  0.146459               NaN    False
2017-10-16  0.000000  0.131361               NaN    False
2017-10-17  0.000000  0.116264               3.0    False

这个想法是你首先生成插值(就像你做的那样),然后决定放弃哪些观察。首先指定当前观察与最后一次观察之间的天数。由于您要丢弃此数字超过5的条目以及之前的条目,因此请使用.bfill将此数字指定给之前的插值,然后再进行比较。但是,请注意,对于正数丢弃决策,观察将被丢弃,你不想要。因此,您需要包含不丢弃观察结果的条件,您可以使用.notnull()列上的last_observation方法进行检查。

最后,使用.where方法保留不符合丢弃标准的条目;默认情况下,其他人被NA取代。

答案 2 :(得分:1)

如果我理解正确,你可以通过布尔索引删除不必要的行。假设您在名为diff的列中的天数不同,则可以使用df.loc[df['diff'].dt.days < 5]

这是一个演示

df = pd.read_clipboard()

               col_1    vals
2017-10-01  0.000000    0.112869
2017-10-02  0.017143    0.112869
2017-10-12  0.003750    0.117274
2017-10-14  0.000000    0.161556
2017-10-17  0.000000    0.116264

转换为时间列并获取差异的新列,以天为单位的下一个值

df = df.reset_index()
df['index']=pd.to_datetime(df['index'])
df['diff'] = df['index'] - df['index'].shift(1)


       index    col_1       vals       diff
0   2017-10-01  0.000000    0.112869    NaT
1   2017-10-02  0.017143    0.112869    1 days
2   2017-10-12  0.003750    0.117274    10 days
3   2017-10-14  0.000000    0.161556    2 days
4   2017-10-17  0.000000    0.116264    3 days

添加boolian过滤器

new_df = df.loc[df['diff'].dt.days < 5]
new_df = new_df.drop('diff', axis=1)
new_df.set_index('index', inplace=True)
new_df

               col_1    vals
index       
2017-10-02  0.017143    0.112869
2017-10-14  0.000000    0.161556
2017-10-17  0.000000    0.116264

答案 3 :(得分:1)

我在你的例子中添加了几行,以便有两个块,行间距超过5天 我在本地保存了两个表作为.csv文件,并添加了date作为第一个列名来完成下面的合并:

<强>设置

import pandas as pd
import numpy as np
df_1=pd.read_csv('df_1.csv', delimiter=r"\s+")
df_2=pd.read_csv('df_2.csv', delimiter=r"\s+")

合并(加入)两个数据集并重命名列:
注意两组间隔超过5天。

df=df_2.merge(df_1, how='left', on='Date').reset_index(drop=True)
df.columns=['date','col','val','col_na','val_na']    #purely aesthetic

df

    date        col         val         col_na      val_na
0   2017-10-01  0.000000    0.112869    0.000000    0.112869
1   2017-10-02  0.017143    0.112869    0.017143    0.112869
2   2017-10-03  0.015804    0.113309    NaN         NaN
3   2017-10-04  0.014464    0.113750    NaN         NaN
4   2017-10-05  0.013125    0.114190    NaN         NaN
5   2017-10-06  0.011786    0.114631    NaN         NaN
6   2017-10-07  0.010446    0.115071    NaN         NaN
7   2017-10-08  0.009107    0.115512    NaN         NaN
8   2017-10-09  0.007768    0.115953    NaN         NaN
9   2017-10-10  0.006429    0.116393    NaN         NaN
10  2017-10-11  0.005089    0.116834    NaN         NaN
11  2017-10-12  0.003750    0.117274    0.003750    0.117274
12  2017-10-13  0.001875    0.139415    NaN         NaN
13  2017-10-14  0.000000    0.161556    0.000000    0.161556
14  2017-10-15  0.000000    0.146459    NaN         NaN
15  2017-10-16  0.000000    0.131361    NaN         NaN
16  2017-10-17  0.000000    0.989999    0.000000    0.116264
17  2017-10-18  0.000000    0.412311    NaN         NaN
18  2017-10-19  0.000000    0.166264    NaN         NaN
19  2017-10-20  0.000000    0.123464    NaN         NaN
20  2017-10-21  0.000000    0.149767    NaN         NaN
21  2017-10-22  0.000000    0.376455    NaN         NaN
22  2017-10-23  0.000000    0.000215    NaN         NaN
23  2017-10-24  0.000000    0.940219    NaN         NaN
24  2017-10-25  0.000000    0.030352    0.000000    0.030352
25  2017-10-26  0.000000    0.111112    NaN         NaN
26  2017-10-27  0.000000    0.002500    NaN         NaN

方法执行任务

def my_func(my_df):
    non_na_index=[]                                      #define empty list
    for i in range(len(my_df.iloc[:,[1]])):
        if not pd.isnull(my_df.iloc[i,[3]][0]):
            non_na_index.append(i)                       #add indexes of rows that that have non NaN value  
    sub=np.roll(non_na_index, shift=-1)-non_na_index     #subract column in indexes to find row count of NaN   
    sub=sub[:-1]                                         #get rid of last element (calculation artifact)
    for i in reversed(range(len(sub))):
        if sub[i]>=5:                       #identidy indexes with more than 5 NaN in between
            b=non_na_index[i+1]             #assign end index
            a=non_na_index[i]+1             #assign start index
            my_df=my_df.drop(my_df.index[[range(a,b)]])  #drop the rows within the range
    return(my_df)

使用df

执行该功能
new_df=my_func(df)
new_df=df.drop(['col_na','val_na'],1)    # drop the two extra columns
new_df

    date        col         val
0   2017-10-01  0.000000    0.112869
1   2017-10-02  0.017143    0.112869
11  2017-10-12  0.003750    0.117274
12  2017-10-13  0.001875    0.139415
13  2017-10-14  0.000000    0.161556
14  2017-10-15  0.000000    0.146459
15  2017-10-16  0.000000    0.131361
16  2017-10-17  0.000000    0.989999
24  2017-10-25  0.000000    0.030352
25  2017-10-26  0.000000    0.111112
26  2017-10-27  0.000000    0.002500