我有一个包含多个组的大型数据集,其中包含两列开始和结束日期以及一个值列(每个组可以具有多个值) 我想有效地扩展它,并获得一个新的数据框,该数据框具有时间(以秒为单位)作为每个组的索引和列,将在其中存储值
数据如下:
import pandas as pd
import datetime as dt
import numpy as np
df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3,5,22,21), dt.datetime(2017, 4, 5,3,51,22),\
dt.datetime(2017, 4, 4,4,23,33),dt.datetime(2017, 4, 3,7,28,45),\
dt.datetime(2017, 4, 6,5,22,24),dt.datetime(2017, 4, 6,5,22,56)]
df['end'] = [dt.datetime(2017, 4, 3,6,33,23), dt.datetime(2017, 4,5,3,52,46),\
dt.datetime(2017, 4,4,4,58,12),dt.datetime(2017, 4, 4,1,23,34),\
dt.datetime(2017, 4, 7,5,22,24),dt.datetime(2017, 4, 7,5,22,47)]
df['group'] = ['1', '2', '3','1','2','3']
df['value'] = ['a', 'b', 'c','b','c','a']
start end group value
0 2017-04-03 05:22:21 2017-04-03 06:33:23 1 a
1 2017-04-05 03:51:22 2017-04-05 03:52:46 2 b
2 2017-04-04 04:23:33 2017-04-04 04:58:12 3 c
3 2017-04-03 07:28:45 2017-04-04 01:23:34 1 b
4 2017-04-06 05:22:24 2017-04-03 05:22:24 2 c
5 2017-04-03 05:22:56 2017-04-03 05:22:47 3 a
我尝试了以下方法:
以索引的最早开始和最后结束的范围构造一个新的数据框。
按组ID分组
遍历各组行,从每行创建一个小的数据框,并在该行的开始日期和存储该行值的结束日期中进行索引
4。将同一组中的小型数据帧连接为一个数据帧
以下是代码段:
def turn_deltas(row,col):
key = str(row['group'])
df = pd.DataFrame(index=pd.date_range(row['start'], row['end'], freq="1S"))
df[key] = row[col]
return df
grouped = df.groupby("group")
data = pd.DataFrame(index=pd.date_range(df['start'].min(), df['end'].max(), freq="1s"))
for name, group in (grouped):
for i, row in enumerate(group.iterrows()):
if i == 0:
df_2 = turn_deltas(row[1],"value")
else:
df_2 = pd.concat([df_2, turn_deltas(row[1],"value")], axis=0)
data = data.merge(df_2, how="left", left_index=True, right_index=True)
print (data)
我的代码可以正常工作,但是执行任务的速度(非常慢)
最后,我得到了这个更新的数据框:
2017-04-03 05:22:21 a NaN NaN
2017-04-03 05:22:22 a NaN NaN
2017-04-03 05:22:23 a NaN NaN
2017-04-03 05:22:24 a NaN NaN
2017-04-03 05:22:25 a NaN NaN
2017-04-03 05:22:26 a NaN NaN
2017-04-03 05:22:27 a NaN NaN
2017-04-03 05:22:28 a NaN NaN
2017-04-03 05:22:29 a NaN NaN
2017-04-03 05:22:30 a NaN NaN
2017-04-03 05:22:31 a NaN NaN
2017-04-03 05:22:32 a NaN NaN
2017-04-03 05:22:33 a NaN NaN
2017-04-03 05:22:34 a NaN NaN
2017-04-03 05:22:35 a NaN NaN
2017-04-03 05:22:36 a NaN NaN
2017-04-03 05:22:37 a NaN NaN
2017-04-03 05:22:38 a NaN NaN
2017-04-03 05:22:39 a NaN NaN
2017-04-03 05:22:40 a NaN NaN
2017-04-03 05:22:41 a NaN NaN
2017-04-03 05:22:42 a NaN NaN
2017-04-03 05:22:43 a NaN NaN
2017-04-03 05:22:44 a NaN NaN
2017-04-03 05:22:45 a NaN NaN
2017-04-03 05:22:46 a NaN NaN
2017-04-03 05:22:47 a NaN NaN
2017-04-03 05:22:48 a NaN NaN
2017-04-03 05:22:49 a NaN NaN
2017-04-03 05:22:50 a NaN NaN
... ... ... ...
2017-04-07 05:22:18 NaN c a
2017-04-07 05:22:19 NaN c a
2017-04-07 05:22:20 NaN c a
2017-04-07 05:22:21 NaN c a
2017-04-07 05:22:22 NaN c a
2017-04-07 05:22:23 NaN c a
2017-04-07 05:22:24 NaN c a
2017-04-07 05:22:25 NaN NaN a
2017-04-07 05:22:26 NaN NaN a
2017-04-07 05:22:27 NaN NaN a
2017-04-07 05:22:28 NaN NaN a
2017-04-07 05:22:29 NaN NaN a
2017-04-07 05:22:30 NaN NaN a
2017-04-07 05:22:31 NaN NaN a
2017-04-07 05:22:32 NaN NaN a
2017-04-07 05:22:33 NaN NaN a
2017-04-07 05:22:34 NaN NaN a
2017-04-07 05:22:35 NaN NaN a
2017-04-07 05:22:36 NaN NaN a
2017-04-07 05:22:37 NaN NaN a
2017-04-07 05:22:38 NaN NaN a
2017-04-07 05:22:39 NaN NaN a
2017-04-07 05:22:40 NaN NaN a
2017-04-07 05:22:41 NaN NaN a
2017-04-07 05:22:42 NaN NaN a
2017-04-07 05:22:43 NaN NaN a
2017-04-07 05:22:44 NaN NaN a
2017-04-07 05:22:45 NaN NaN a
2017-04-07 05:22:46 NaN NaN a
2017-04-07 05:22:47 NaN NaN a
注意:
这段代码只是整个项目的一部分。
在执行此转换之后,我还使用get_dummies()
来为每个列的每个值获取一个单独的列,因此您也可以将其纳入实现策略中
谢谢!
答案 0 :(得分:0)
我将使用merge_ordered
为每个由您的data
数据帧的索引索引的组建立一个数据帧。它将具有不需要的值,因此应清除它们。但是从那时起,很容易构建最终数据框:
for g, dg in df.groupby('group'):
# build a dataframe per group with the final index
dy = pd.merge_ordered(data.rename_axis('dat').reset_index(), dg,
left_on='dat', right_on='start', fill_method='ffill')
# clean values outside of [start:end] range
dy.loc[(dy.start>dy.dat)|(dy.dat>dy.end), 'group'] = np.nan
dy.loc[(dy.start>dy.dat)|(dy.dat>dy.end), 'value'] = np.nan
# and use that to set the column in the final dataframe
data[g] = dy.set_index('dat').value
如果性能确实很重要,则正确使用索引会有所不同。这个版本应该快大约3倍:
for g, dg in df.groupby('group'):
# build a dataframe per group with the final index
dy = pd.merge_asof(data, dg.set_index('start'),
left_index=True, right_index=True)
# clean values outside of [start:end] range
dy.loc[dy.index>dy.end,'value'] = np.nan
# and use that to set the column in the final dataframe
data[g] = dy.value
答案 1 :(得分:0)
首先,您应该将值真正转换为除对象以外的其他dtype,即使用0,1,2代替'a','b','c'。
至于转换代码,这似乎非常快,至少在您的示例df上如此。而且也很简短易读。
data = pd.DataFrame(index=pd.date_range(df['start'].min(), df['end'].max(), freq="1S"))
for i,row in df.iterrows():
data.loc[(data.index >= row['start'])&(data.index<=row['end']),
row['group']] = row['value']