我有一个pandas数据帧,我进行插值以获取每日数据帧。原始数据框如下所示:
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-14 0.000000 0.161556
2017-10-17 0.000000 0.116264
在插值数据帧中,我想将数据值更改为NaN,其中日期差距超过5天。例如。在上面的数据框中,2017-10-02
和2017-10-12
之间的差距超过5天,因此在插值数据框中,应删除这两个日期之间的所有值。我不知道怎么做,也许combine_first
?
- 编辑:插值数据框如下所示:
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-03 0.015804 0.113309
2017-10-04 0.014464 0.113750
2017-10-05 0.013125 0.114190
2017-10-06 0.011786 0.114631
2017-10-07 0.010446 0.115071
2017-10-08 0.009107 0.115512
2017-10-09 0.007768 0.115953
2017-10-10 0.006429 0.116393
2017-10-11 0.005089 0.116834
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
预期产出:
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
答案 0 :(得分:10)
我首先要确定缺口超过5天的位置。从那里,我生成一个数组,确定这些差距之间的组。最后,我使用groupby
转到每日频率和插值。
# convenience: assign string to variable for easier access
daytype = 'timedelta64[D]'
# define five days for use when evaluating size of gaps
five = np.array(5, dtype=daytype)
# get the size of gaps
deltas = np.diff(df.index.values).astype(daytype)
# identify groups between gaps
groups = np.append(False, deltas > five).cumsum()
# handy function to turn to daily frequency and interpolate
to_daily = lambda x: x.asfreq('D').interpolate()
# and finally...
df.groupby(groups, group_keys=False).apply(to_daily)
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
如果您想提供自己的插值方法。您可以像这样修改上述内容:
daytype = 'timedelta64[D]'
five = np.array(5, dtype=daytype)
deltas = np.diff(df.index.values).astype(daytype)
groups = np.append(False, deltas > five).cumsum()
# custom interpolation function that takes a dataframe
def my_interpolate(df):
"""This can be whatever you want.
I just provided what will result
in the same thing as before."""
return df.interpolate()
to_daily = lambda x: x.asfreq('D').pipe(my_interpolate)
df.groupby(groups, group_keys=False).apply(to_daily)
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-13 0.001875 0.139415
2017-10-14 0.000000 0.161556
2017-10-15 0.000000 0.146459
2017-10-16 0.000000 0.131361
2017-10-17 0.000000 0.116264
答案 1 :(得分:1)
这是你想要的吗?
data0 = """2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-14 0.000000 0.161556
2017-10-17 0.000000 0.116264"""
data = [row.split(' ') for row in data0.split('\n')]
df = pd.DataFrame(data, columns = ['date','col_1','vals'])
df.date = pd.to_datetime(df.date)
last_observation = df.assign(last_observation = df.date.diff().dt.days)
df.set_index(['date'], inplace = True)
all_dates = pd.date_range(start = last_observation.date.min(),
end = last_observation.date.max())
df_interpolated = df.reindex(all_dates).astype(np.float64).interpolate()
df_interpolated = df_interpolated.join(last_observation.set_index('date').last_observation)
df_interpolated['discard'] = (df_interpolated.last_observation.bfill() > 5) & df_interpolated.last_observation.isnull()
df_interpolated[['col_1','vals']] = df_interpolated[['col_1','vals']].where(~df_interpolated.discard)
输出结果为:
col_1 vals last_observation discard
2017-10-01 0.000000 0.112869 NaN False
2017-10-02 0.017143 0.112869 1.0 False
2017-10-03 NaN NaN NaN True
2017-10-04 NaN NaN NaN True
2017-10-05 NaN NaN NaN True
2017-10-06 NaN NaN NaN True
2017-10-07 NaN NaN NaN True
2017-10-08 NaN NaN NaN True
2017-10-09 NaN NaN NaN True
2017-10-10 NaN NaN NaN True
2017-10-11 NaN NaN NaN True
2017-10-12 0.003750 0.117274 10.0 False
2017-10-13 0.001875 0.139415 NaN False
2017-10-14 0.000000 0.161556 2.0 False
2017-10-15 0.000000 0.146459 NaN False
2017-10-16 0.000000 0.131361 NaN False
2017-10-17 0.000000 0.116264 3.0 False
这个想法是你首先生成插值(就像你做的那样),然后决定放弃哪些观察。首先指定当前观察与最后一次观察之间的天数。由于您要丢弃此数字超过5的条目以及之前的条目,因此请使用.bfill
将此数字指定给之前的插值,然后再进行比较。但是,请注意,对于正数丢弃决策,观察将被丢弃,你不想要。因此,您需要包含不丢弃观察结果的条件,您可以使用.notnull()
列上的last_observation
方法进行检查。
最后,使用.where
方法保留不符合丢弃标准的条目;默认情况下,其他人被NA取代。
答案 2 :(得分:1)
如果我理解正确,你可以通过布尔索引删除不必要的行。假设您在名为diff
的列中的天数不同,则可以使用df.loc[df['diff'].dt.days < 5]
这是一个演示
df = pd.read_clipboard()
col_1 vals
2017-10-01 0.000000 0.112869
2017-10-02 0.017143 0.112869
2017-10-12 0.003750 0.117274
2017-10-14 0.000000 0.161556
2017-10-17 0.000000 0.116264
转换为时间列并获取差异的新列,以天为单位的下一个值
df = df.reset_index()
df['index']=pd.to_datetime(df['index'])
df['diff'] = df['index'] - df['index'].shift(1)
index col_1 vals diff
0 2017-10-01 0.000000 0.112869 NaT
1 2017-10-02 0.017143 0.112869 1 days
2 2017-10-12 0.003750 0.117274 10 days
3 2017-10-14 0.000000 0.161556 2 days
4 2017-10-17 0.000000 0.116264 3 days
添加boolian过滤器
new_df = df.loc[df['diff'].dt.days < 5]
new_df = new_df.drop('diff', axis=1)
new_df.set_index('index', inplace=True)
new_df
col_1 vals
index
2017-10-02 0.017143 0.112869
2017-10-14 0.000000 0.161556
2017-10-17 0.000000 0.116264
答案 3 :(得分:1)
我在你的例子中添加了几行,以便有两个块,行间距超过5天
我在本地保存了两个表作为.csv文件,并添加了date
作为第一个列名来完成下面的合并:
<强>设置强>
import pandas as pd
import numpy as np
df_1=pd.read_csv('df_1.csv', delimiter=r"\s+")
df_2=pd.read_csv('df_2.csv', delimiter=r"\s+")
合并(加入)两个数据集并重命名列:
注意两组间隔超过5天。
df=df_2.merge(df_1, how='left', on='Date').reset_index(drop=True)
df.columns=['date','col','val','col_na','val_na'] #purely aesthetic
df
date col val col_na val_na
0 2017-10-01 0.000000 0.112869 0.000000 0.112869
1 2017-10-02 0.017143 0.112869 0.017143 0.112869
2 2017-10-03 0.015804 0.113309 NaN NaN
3 2017-10-04 0.014464 0.113750 NaN NaN
4 2017-10-05 0.013125 0.114190 NaN NaN
5 2017-10-06 0.011786 0.114631 NaN NaN
6 2017-10-07 0.010446 0.115071 NaN NaN
7 2017-10-08 0.009107 0.115512 NaN NaN
8 2017-10-09 0.007768 0.115953 NaN NaN
9 2017-10-10 0.006429 0.116393 NaN NaN
10 2017-10-11 0.005089 0.116834 NaN NaN
11 2017-10-12 0.003750 0.117274 0.003750 0.117274
12 2017-10-13 0.001875 0.139415 NaN NaN
13 2017-10-14 0.000000 0.161556 0.000000 0.161556
14 2017-10-15 0.000000 0.146459 NaN NaN
15 2017-10-16 0.000000 0.131361 NaN NaN
16 2017-10-17 0.000000 0.989999 0.000000 0.116264
17 2017-10-18 0.000000 0.412311 NaN NaN
18 2017-10-19 0.000000 0.166264 NaN NaN
19 2017-10-20 0.000000 0.123464 NaN NaN
20 2017-10-21 0.000000 0.149767 NaN NaN
21 2017-10-22 0.000000 0.376455 NaN NaN
22 2017-10-23 0.000000 0.000215 NaN NaN
23 2017-10-24 0.000000 0.940219 NaN NaN
24 2017-10-25 0.000000 0.030352 0.000000 0.030352
25 2017-10-26 0.000000 0.111112 NaN NaN
26 2017-10-27 0.000000 0.002500 NaN NaN
方法执行任务
def my_func(my_df):
non_na_index=[] #define empty list
for i in range(len(my_df.iloc[:,[1]])):
if not pd.isnull(my_df.iloc[i,[3]][0]):
non_na_index.append(i) #add indexes of rows that that have non NaN value
sub=np.roll(non_na_index, shift=-1)-non_na_index #subract column in indexes to find row count of NaN
sub=sub[:-1] #get rid of last element (calculation artifact)
for i in reversed(range(len(sub))):
if sub[i]>=5: #identidy indexes with more than 5 NaN in between
b=non_na_index[i+1] #assign end index
a=non_na_index[i]+1 #assign start index
my_df=my_df.drop(my_df.index[[range(a,b)]]) #drop the rows within the range
return(my_df)
使用df
new_df=my_func(df)
new_df=df.drop(['col_na','val_na'],1) # drop the two extra columns
new_df
date col val
0 2017-10-01 0.000000 0.112869
1 2017-10-02 0.017143 0.112869
11 2017-10-12 0.003750 0.117274
12 2017-10-13 0.001875 0.139415
13 2017-10-14 0.000000 0.161556
14 2017-10-15 0.000000 0.146459
15 2017-10-16 0.000000 0.131361
16 2017-10-17 0.000000 0.989999
24 2017-10-25 0.000000 0.030352
25 2017-10-26 0.000000 0.111112
26 2017-10-27 0.000000 0.002500