如何使用组有效地扩展日期范围熊猫数据框

时间:2019-05-13 07:20:05

标签: python pandas date pandas-groupby processing-efficiency

我有一个包含多个组的大型数据集,其中包含两列开始和结束日期以及一个值列(每个组可以具有多个值) 我想有效地扩展它,并获得一个新的数据框,该数据框具有时间(以秒为单位)作为每个组的索引和列,将在其中存储值

数据如下:

import pandas as pd
import datetime as dt
import numpy as np

df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3,5,22,21), dt.datetime(2017, 4, 5,3,51,22),\
               dt.datetime(2017, 4, 4,4,23,33),dt.datetime(2017, 4, 3,7,28,45),\
               dt.datetime(2017, 4, 6,5,22,24),dt.datetime(2017, 4, 6,5,22,56)]

df['end'] = [dt.datetime(2017, 4, 3,6,33,23), dt.datetime(2017, 4,5,3,52,46),\
             dt.datetime(2017, 4,4,4,58,12),dt.datetime(2017, 4, 4,1,23,34),\
            dt.datetime(2017, 4, 7,5,22,24),dt.datetime(2017, 4, 7,5,22,47)]
df['group'] = ['1', '2', '3','1','2','3']
df['value'] = ['a', 'b', 'c','b','c','a']

start   end group   value
0   2017-04-03 05:22:21 2017-04-03 06:33:23 1   a
1   2017-04-05 03:51:22 2017-04-05 03:52:46 2   b
2   2017-04-04 04:23:33 2017-04-04 04:58:12 3   c
3   2017-04-03 07:28:45 2017-04-04 01:23:34 1   b
4   2017-04-06 05:22:24 2017-04-03 05:22:24 2   c
5   2017-04-03 05:22:56 2017-04-03 05:22:47 3   a

我尝试了以下方法:

  1. 以索引的最早开始和最后结束的范围构造一个新的数据框。

  2. 按组ID分组

  3. 遍历各组行,从每行创建一个小的数据框,并在该行的开始日期和存储该行值的结束日期中进行索引

4。将同一组中的小型数据帧连接为一个数据帧

  1. 将组数据框(实际上是日期索引中的值列)联接(左联接)到大数据框(将其添加为列)

以下是代码段:


def turn_deltas(row,col):
    key = str(row['group'])
    df = pd.DataFrame(index=pd.date_range(row['start'], row['end'], freq="1S"))
    df[key] = row[col]
    return df

grouped = df.groupby("group")
data = pd.DataFrame(index=pd.date_range(df['start'].min(), df['end'].max(), freq="1s")) 
for name, group in (grouped):
    for i, row in enumerate(group.iterrows()):
        if i == 0:
            df_2 = turn_deltas(row[1],"value")
        else:
            df_2 = pd.concat([df_2, turn_deltas(row[1],"value")], axis=0)
    data = data.merge(df_2, how="left", left_index=True, right_index=True)

print (data)

我的代码可以正常工作,但是执行任务的速度(非常慢)

最后,我得到了这个更新的数据框:

2017-04-03 05:22:21    a  NaN  NaN
2017-04-03 05:22:22    a  NaN  NaN
2017-04-03 05:22:23    a  NaN  NaN
2017-04-03 05:22:24    a  NaN  NaN
2017-04-03 05:22:25    a  NaN  NaN
2017-04-03 05:22:26    a  NaN  NaN
2017-04-03 05:22:27    a  NaN  NaN
2017-04-03 05:22:28    a  NaN  NaN
2017-04-03 05:22:29    a  NaN  NaN
2017-04-03 05:22:30    a  NaN  NaN
2017-04-03 05:22:31    a  NaN  NaN
2017-04-03 05:22:32    a  NaN  NaN
2017-04-03 05:22:33    a  NaN  NaN
2017-04-03 05:22:34    a  NaN  NaN
2017-04-03 05:22:35    a  NaN  NaN
2017-04-03 05:22:36    a  NaN  NaN
2017-04-03 05:22:37    a  NaN  NaN
2017-04-03 05:22:38    a  NaN  NaN
2017-04-03 05:22:39    a  NaN  NaN
2017-04-03 05:22:40    a  NaN  NaN
2017-04-03 05:22:41    a  NaN  NaN
2017-04-03 05:22:42    a  NaN  NaN
2017-04-03 05:22:43    a  NaN  NaN
2017-04-03 05:22:44    a  NaN  NaN
2017-04-03 05:22:45    a  NaN  NaN
2017-04-03 05:22:46    a  NaN  NaN
2017-04-03 05:22:47    a  NaN  NaN
2017-04-03 05:22:48    a  NaN  NaN
2017-04-03 05:22:49    a  NaN  NaN
2017-04-03 05:22:50    a  NaN  NaN
...                  ...  ...  ...
2017-04-07 05:22:18  NaN    c    a
2017-04-07 05:22:19  NaN    c    a
2017-04-07 05:22:20  NaN    c    a
2017-04-07 05:22:21  NaN    c    a
2017-04-07 05:22:22  NaN    c    a
2017-04-07 05:22:23  NaN    c    a
2017-04-07 05:22:24  NaN    c    a
2017-04-07 05:22:25  NaN  NaN    a
2017-04-07 05:22:26  NaN  NaN    a
2017-04-07 05:22:27  NaN  NaN    a
2017-04-07 05:22:28  NaN  NaN    a
2017-04-07 05:22:29  NaN  NaN    a
2017-04-07 05:22:30  NaN  NaN    a
2017-04-07 05:22:31  NaN  NaN    a
2017-04-07 05:22:32  NaN  NaN    a
2017-04-07 05:22:33  NaN  NaN    a
2017-04-07 05:22:34  NaN  NaN    a
2017-04-07 05:22:35  NaN  NaN    a
2017-04-07 05:22:36  NaN  NaN    a
2017-04-07 05:22:37  NaN  NaN    a
2017-04-07 05:22:38  NaN  NaN    a
2017-04-07 05:22:39  NaN  NaN    a
2017-04-07 05:22:40  NaN  NaN    a
2017-04-07 05:22:41  NaN  NaN    a
2017-04-07 05:22:42  NaN  NaN    a
2017-04-07 05:22:43  NaN  NaN    a
2017-04-07 05:22:44  NaN  NaN    a
2017-04-07 05:22:45  NaN  NaN    a
2017-04-07 05:22:46  NaN  NaN    a
2017-04-07 05:22:47  NaN  NaN    a

注意: 这段代码只是整个项目的一部分。 在执行此转换之后,我还使用get_dummies()来为每个列的每个值获取一个单独的列,因此您也可以将其纳入实现策略中

谢谢!

2 个答案:

答案 0 :(得分:0)

我将使用merge_ordered为每个由您的data数据帧的索引索引的组建立一个数据帧。它将具有不需要的值,因此应清除它们。但是从那时起,很容易构建最终数据框:

for g, dg in df.groupby('group'):
    # build a dataframe per group with the final index
    dy = pd.merge_ordered(data.rename_axis('dat').reset_index(), dg,
         left_on='dat', right_on='start', fill_method='ffill')
    # clean values outside of [start:end] range
    dy.loc[(dy.start>dy.dat)|(dy.dat>dy.end), 'group'] = np.nan
    dy.loc[(dy.start>dy.dat)|(dy.dat>dy.end), 'value'] = np.nan
    # and use that to set the column in the final dataframe
    data[g] = dy.set_index('dat').value

如果性能确实很重要,则正确使用索引会有所不同。这个版本应该快大约3倍:

for g, dg in df.groupby('group'):
    # build a dataframe per group with the final index
    dy = pd.merge_asof(data, dg.set_index('start'),
                 left_index=True, right_index=True)
    # clean values outside of [start:end] range
    dy.loc[dy.index>dy.end,'value'] = np.nan
    # and use that to set the column in the final dataframe
    data[g] = dy.value

答案 1 :(得分:0)

首先,您应该将值真正转换为除对象以外的其他dtype,即使用0,1,2代替'a','b','c'。

至于转换代码,这似乎非常快,至少在您的示例df上如此。而且也很简短易读。

data = pd.DataFrame(index=pd.date_range(df['start'].min(), df['end'].max(), freq="1S"))

for i,row in df.iterrows():
    data.loc[(data.index >= row['start'])&(data.index<=row['end']),
             row['group']] = row['value']