根据开始和结束列(速度)扩展数据框

时间:2017-05-07 14:09:26

标签: python pandas numpy

我有一个pandas.DataFrame包含startend列,以及其他几列。我想将此数据框扩展为以start值开头并以end值结束的时间序列,但复制其他列。到目前为止,我想出了以下内容:

import pandas as pd
import datetime as dt

df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12), dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']

data_series = list()
for row in df.itertuples():
    time_range = pd.bdate_range(row.start, row.end)
    s = len(time_range)
    data_series += (zip(time_range, [row.start]*s, [row.end]*s, [row.country]*s, [row.letter]*s))

columns_names = ['date', 'start', 'end', 'country', 'letter']
df = pd.DataFrame(data_series, columns=columns_names)

启动Dataframe:

       start        end country letter
0 2017-04-03 2017-04-10      US      a
1 2017-04-05 2017-04-12      EU      b
2 2017-04-10 2017-04-17      UK      c

期望的输出:

         date      start        end country letter
0  2017-04-03 2017-04-03 2017-04-10      US      a
1  2017-04-04 2017-04-03 2017-04-10      US      a
2  2017-04-05 2017-04-03 2017-04-10      US      a
3  2017-04-06 2017-04-03 2017-04-10      US      a
4  2017-04-07 2017-04-03 2017-04-10      US      a
5  2017-04-10 2017-04-03 2017-04-10      US      a
6  2017-04-05 2017-04-05 2017-04-12      EU      b
7  2017-04-06 2017-04-05 2017-04-12      EU      b
8  2017-04-07 2017-04-05 2017-04-12      EU      b
9  2017-04-10 2017-04-05 2017-04-12      EU      b
10 2017-04-11 2017-04-05 2017-04-12      EU      b
11 2017-04-12 2017-04-05 2017-04-12      EU      b
12 2017-04-10 2017-04-10 2017-04-17      UK      c
13 2017-04-11 2017-04-10 2017-04-17      UK      c
14 2017-04-12 2017-04-10 2017-04-17      UK      c
15 2017-04-13 2017-04-10 2017-04-17      UK      c
16 2017-04-14 2017-04-10 2017-04-17      UK      c
17 2017-04-17 2017-04-10 2017-04-17      UK      c

我的解决方案的问题在于,当将它应用于更大的数据帧(主要是行数)时,它对我来说不能足够快地达到结果。有没有人对我如何改进有任何想法?我也在考虑numpy的解决方案。

2 个答案:

答案 0 :(得分:1)

对于您的数据框:

df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12), dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']

       start        end country letter
0 2017-04-03 2017-04-10      US      a
1 2017-04-05 2017-04-12      EU      b
2 2017-04-10 2017-04-17      UK      c

首先,使用pd.date_rangeDatetimeIndex()设置新索引,限制第一个和最后一个值:

index = pd.date_range(df['start'].iloc[0], df['end'].iloc[-1])
# or
index = pd.DatetimeIndex(start=df['start'].iloc[0], end=df['end'].iloc[-1], freq='D')

>>
index = pd.DatetimeIndex(['2017-04-03', '2017-04-04', '2017-04-05', '2017-04-06',
               '2017-04-07', '2017-04-08', '2017-04-09', '2017-04-10',
               '2017-04-11', '2017-04-12', '2017-04-13', '2017-04-14',
               '2017-04-15', '2017-04-16', '2017-04-17'],
              dtype='datetime64[ns]', freq='D')

然后reindex()method=ffill以及reset_indexrename

df2 = df.set_index(['start']).reindex(index, method='ffill')
df2['Date'] = df2.index
df2.reset_index().rename(columns={'index':'start'})

        start        end country letter       Date
0  2017-04-03 2017-04-10      US      a 2017-04-03
1  2017-04-04 2017-04-10      US      a 2017-04-04
2  2017-04-05 2017-04-12      EU      b 2017-04-05
3  2017-04-06 2017-04-12      EU      b 2017-04-06
4  2017-04-07 2017-04-12      EU      b 2017-04-07
5  2017-04-08 2017-04-12      EU      b 2017-04-08
6  2017-04-09 2017-04-12      EU      b 2017-04-09
7  2017-04-10 2017-04-17      UK      c 2017-04-10
8  2017-04-11 2017-04-17      UK      c 2017-04-11
9  2017-04-12 2017-04-17      UK      c 2017-04-12
10 2017-04-13 2017-04-17      UK      c 2017-04-13
11 2017-04-14 2017-04-17      UK      c 2017-04-14
12 2017-04-15 2017-04-17      UK      c 2017-04-15
13 2017-04-16 2017-04-17      UK      c 2017-04-16
14 2017-04-17 2017-04-17      UK      c 2017-04-17

# Time:
0.009 s

旧答案:

reindex()栏上使用start并使用asfreq()展开:

df2 = df.set_index(['start']).asfreq('D').fillna(method='ffill').reset_index()
>>>
       start        end country letter
0 2017-04-03 2017-04-10      US      a
1 2017-04-04 2017-04-10      US      a
2 2017-04-05 2017-04-12      EU      b
3 2017-04-06 2017-04-12      EU      b
4 2017-04-07 2017-04-12      EU      b
5 2017-04-08 2017-04-12      EU      b
6 2017-04-09 2017-04-12      EU      b
7 2017-04-10 2017-04-17      UK      c

asfreq文档:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.asfreq.html

答案 1 :(得分:0)

由于这里的目标是速度,我们应该逐步实现每一步。单个for循环可以使代码减慢多个数量级。下面提供了矢量化解决方案:

cols = list(df.columns)

df['data_id'] = np.arange(0, len(df))

data_id = df['data_id']
start = df['start']
end = df['end']

diff = ((end-start) / np.timedelta64(1, 'D')).astype('int') + 1
repeated_id = np.repeat(data_id, diff)
time_df = pd.DataFrame(data={'data_id': repeated_id})
time_df = pd.merge(left=time_df, right=df[['data_id', 'start']], on=['data_id'])
time_df['day_id'] = np.arange(0, len(time_df))

min_day_id = time_df.groupby('data_id')['day_id'].min().reset_index().rename(columns={'day_id': 'min_day_id'})
time_df = pd.merge(left=time_df, right=min_day_id, on=['data_id'])
days_to_add = (time_df['day_id'] - time_df['min_day_id']) * np.timedelta64(1, 'D')
time_df['date'] = time_df['start'] + days_to_add

time_df = time_df[time_df['date'].dt.dayofweek < 5]

df = pd.merge(left=df, right=time_df[['data_id', 'date']], on=['data_id'])
df = df[['date']+cols]

矢量化版本的工作原理如下:

  • 为原始数据框的每一行添加一个&#39; id&#39;列(数据框索引也可以工作)
  • 创建一个新的数据框,将每行扩展到范围内的天数
  • 使用整数0..N
  • 枚举每组扩展行
  • 将此天数添加到开始日期以获取实际日期
  • 过滤营业日
  • 将展开的数据框与日期合并回“&id”标识的原始数据框。柱

在比较jezrael的定时游戏时,原始方法&#39;在我的机器上花了1.15秒,而矢量化版本花了56.9ms, 20倍加速