我需要重新采样时间序列数据并在一个小时的过程中以15分钟的间隔内插缺失值。每个ID每小时应包含四行数据。
在:
ID Time Value
1 1/1/2019 12:17 3
1 1/1/2019 12:44 2
2 1/1/2019 12:02 5
2 1/1/2019 12:28 7
出局:
ID Time Value
1 2019-01-01 12:00:00 3.0
1 2019-01-01 12:15:00 3.0
1 2019-01-01 12:30:00 2.0
1 2019-01-01 12:45:00 2.0
2 2019-01-01 12:00:00 5.0
2 2019-01-01 12:15:00 7.0
2 2019-01-01 12:30:00 7.0
2 2019-01-01 12:45:00 7.0
我编写了一个函数来执行此操作,但是在尝试处理更大的数据集时,效率急剧下降。
有没有更有效的方法?
import datetime
import pandas as pd
data = pd.DataFrame({'ID': [1,1,2,2],
'Time': ['1/1/2019 12:17','1/1/2019 12:44','1/1/2019 12:02','1/1/2019 12:28'],
'Value': [3,2,5,7]})
def clean_dataset(data):
ids = data.drop_duplicates(subset='ID')
data['Time'] = pd.to_datetime(data['Time'])
data['Time'] = data['Time'].apply(
lambda dt: datetime.datetime(dt.year, dt.month, dt.day, dt.hour,15*(dt.minute // 15)))
data = data.drop_duplicates(subset=['Time','ID']).reset_index(drop=True)
df = pd.DataFrame(columns=['Time','ID','Value'])
for i in range(ids.shape[0]):
times = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
id_data = data[data['ID']==ids.iloc[i]['ID']]
clean_data = times.join(id_data.set_index('Time'), on='Time')
clean_data = clean_data.interpolate(method='linear', limit_direction='both')
clean_data.drop(clean_data.tail(1).index,inplace=True)
df = df.append(clean_data)
return df
clean_dataset(data)
答案 0 :(得分:2)
对于大数据集,线性插值的确会变慢。代码中存在循环也是造成速度下降的主要原因。任何可以从循环中删除并预先计算的内容都将有助于提高效率。例如,如果您预定义用于初始化times
的数据帧,则代码的效率将提高14%:
times_template = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
for i in range(ids.shape[0]):
times = times_template.copy()
对代码进行性能分析可确认插值所花费的时间最长(22.7%),其次是联接(13.1%),附加(7.71%),然后是drop(7.67%)命令。
答案 1 :(得分:1)
您可以使用:
#round datetimes by 15 minutes
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#change date range for 4 values (to `12:45`)
rng = pd.date_range('1/1/2019 12:00','1/1/2019 12:45',freq='15min')
#create MultiIndex and reindex
mux = pd.MultiIndex.from_product([data['ID'].unique(), rng], names=['ID','Time'])
data = data.set_index(['ID','Time']).reindex(mux).reset_index()
#interpolate per groups
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
4 2 2019-01-01 12:00:00 5.0
5 2 2019-01-01 12:15:00 7.0
6 2 2019-01-01 12:30:00 7.0
7 2 2019-01-01 12:45:00 7.0
如果范围无法更改:
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#end in 13:00
rng = pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min')
mux = pd.MultiIndex.from_product([data['ID'].unique(), rng], names=['ID','Time'])
data = data.set_index(['ID','Time']).reindex(mux).reset_index()
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
#remove last row per groups
data = data[data['ID'].duplicated(keep='last')]
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
5 2 2019-01-01 12:00:00 5.0
6 2 2019-01-01 12:15:00 7.0
7 2 2019-01-01 12:30:00 7.0
8 2 2019-01-01 12:45:00 7.0
编辑:
另一种使用merge
并左联接reindex
的解决方案:
from itertools import product
#round datetimes by 15 minutes
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#change date range for 4 values (to `12:45`)
rng = pd.date_range('1/1/2019 12:00','1/1/2019 12:45',freq='15min')
#create helper DataFrame and merge with left join
df = pd.DataFrame(list(product(data['ID'].unique(), rng)), columns=['ID','Time'])
print (df)
ID Time
0 1 2019-01-01 12:00:00
1 1 2019-01-01 12:15:00
2 1 2019-01-01 12:30:00
3 1 2019-01-01 12:45:00
4 2 2019-01-01 12:00:00
5 2 2019-01-01 12:15:00
6 2 2019-01-01 12:30:00
7 2 2019-01-01 12:45:00
data = df.merge(data, how='left')
##interpolate per groups
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
4 2 2019-01-01 12:00:00 5.0
5 2 2019-01-01 12:15:00 7.0
6 2 2019-01-01 12:30:00 7.0
7 2 2019-01-01 12:45:00 7.0