我有一个数据框,其中包含以1分钟为间隔采样的财务数据。有时可能会丢失一两行数据。
#Example Input---------------------------------------------
open high low close
2019-02-07 16:01:00 124.624 124.627 124.647 124.617
2019-02-07 16:04:00 124.646 124.655 124.664 124.645
# Desired Ouput--------------------------------------------
open high low close
2019-02-07 16:01:00 124.624 124.627 124.647 124.617
2019-02-07 16:02:00 NaN NaN NaN NaN
2019-02-07 16:03:00 NaN NaN NaN NaN
2019-02-07 16:04:00 124.646 124.655 124.664 124.645
我当前的方法基于此帖子- Find missing minute data in time series data using pandas-仅建议如何识别差距。不是如何填充它们。
我正在做的是创建一个间隔为1分钟的DateTimeIndex。然后使用该索引,创建一个全新的数据框,然后可以将其合并到我的原始数据框中,从而填补空白。代码如下所示。似乎有很多方法可以做到这一点。 我想知道是否有更好的方法。也许需要重新采样数据?
import pandas as pd
from datetime import datetime
# Initialise prices dataframe with missing data
prices = pd.DataFrame([[datetime(2019,2,7,16,0), 124.634, 124.624, 124.65, 124.62],[datetime(2019,2,7,16,4), 124.624, 124.627, 124.647, 124.617]])
prices.columns = ['datetime','open','high','low','close']
prices = prices.set_index('datetime')
print(prices)
# Create a new dataframe with complete set of time intervals
idx_ref = pd.DatetimeIndex(start=datetime(2019,2,7,16,0), end=datetime(2019,2,7,16,4),freq='min')
df = pd.DataFrame(index=idx_ref)
# Merge the two dataframes
prices = pd.merge(df, prices, how='outer', left_index=True,
right_index=True)
print(prices)
答案 0 :(得分:3)
将DataFrame.asfreq
与valido ID id_tip id_hr perpro rut ini fin ult_act
------ --- ------ ----- --------- ---------- ------------------------- ----------------------- ----------------------
1 52 001 666 201802 6666666-6 2018-05-01 00:00:00.000 2018-05-10 00:00:00.000 2018-09-12 00:00:00.000
一起使用:
Datetimeindex
答案 1 :(得分:0)
@jezrael的proposal最初对我不起作用,因为我的index
过去与DatetimeIndex
是不同的类型。 prices.asfreq()
的执行清除了所有prices
数据,尽管它用Nan
填补了空白:
open high low close
datetime
2019-02-07 16:00:00 NaN NaN NaN NaN
2019-02-07 16:01:00 NaN NaN NaN NaN
2019-02-07 16:02:00 NaN NaN NaN NaN
2019-02-07 16:03:00 NaN NaN NaN NaN
2019-02-07 16:04:00 NaN NaN NaN NaN
要解决此问题,我必须像这样更改index
列的类型
prices['date'] = pd.to_datetime(prices['datetime'])
prices = prices.set_index('date')
prices.drop(['datetime'], axis=1, inplace=True)
该代码会将'datetime'列的类型转换为DatetimeIndex
类型,并将新列设置为index
现在我可以打电话
prices = prices.asfreq('1Min')
答案 2 :(得分:0)
更手动的答案是:
from datetime import datetime, timedelta
from dateutil import parser
import pandas as pd
df = pd.DataFrame({
'a': ['2021-02-07 11:00:30', '2021-02-07 11:00:31', '2021-02-07 11:00:35'],
'b': [64.8, 64.8, 50.3]
})
max_dt = parser.parse(max(df['a']))
min_dt = parser.parse(min(df['a']))
dt_range = []
while min_dt <= max_dt:
dt_range.append(min_dt.strftime("%Y-%m-%d %H:%M:%S"))
min_dt += timedelta(seconds=1)
complete_df = pd.DataFrame({'a': dt_range})
final_df = complete_df.merge(df, how='left', on='a')
它转换以下数据帧:
a b
0 2021-02-07 11:00:30 64.8
1 2021-02-07 11:00:31 64.8
2 2021-02-07 11:00:35 50.3
到:
a b
0 2021-02-07 11:00:30 64.8
1 2021-02-07 11:00:31 64.8
2 2021-02-07 11:00:32 NaN
3 2021-02-07 11:00:33 NaN
4 2021-02-07 11:00:34 NaN
5 2021-02-07 11:00:35 50.3
我们可以稍后填充它的空值