我有一个pandas.DataFrame
包含start
和end
列,以及其他几列。我想将此数据框扩展为以start
值开头并以end
值结束的时间序列,但复制其他列。到目前为止,我想出了以下内容:
import pandas as pd
import datetime as dt
df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12), dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']
data_series = list()
for row in df.itertuples():
time_range = pd.bdate_range(row.start, row.end)
s = len(time_range)
data_series += (zip(time_range, [row.start]*s, [row.end]*s, [row.country]*s, [row.letter]*s))
columns_names = ['date', 'start', 'end', 'country', 'letter']
df = pd.DataFrame(data_series, columns=columns_names)
启动Dataframe:
start end country letter
0 2017-04-03 2017-04-10 US a
1 2017-04-05 2017-04-12 EU b
2 2017-04-10 2017-04-17 UK c
期望的输出:
date start end country letter
0 2017-04-03 2017-04-03 2017-04-10 US a
1 2017-04-04 2017-04-03 2017-04-10 US a
2 2017-04-05 2017-04-03 2017-04-10 US a
3 2017-04-06 2017-04-03 2017-04-10 US a
4 2017-04-07 2017-04-03 2017-04-10 US a
5 2017-04-10 2017-04-03 2017-04-10 US a
6 2017-04-05 2017-04-05 2017-04-12 EU b
7 2017-04-06 2017-04-05 2017-04-12 EU b
8 2017-04-07 2017-04-05 2017-04-12 EU b
9 2017-04-10 2017-04-05 2017-04-12 EU b
10 2017-04-11 2017-04-05 2017-04-12 EU b
11 2017-04-12 2017-04-05 2017-04-12 EU b
12 2017-04-10 2017-04-10 2017-04-17 UK c
13 2017-04-11 2017-04-10 2017-04-17 UK c
14 2017-04-12 2017-04-10 2017-04-17 UK c
15 2017-04-13 2017-04-10 2017-04-17 UK c
16 2017-04-14 2017-04-10 2017-04-17 UK c
17 2017-04-17 2017-04-10 2017-04-17 UK c
我的解决方案的问题在于,当将它应用于更大的数据帧(主要是行数)时,它对我来说不能足够快地达到结果。有没有人对我如何改进有任何想法?我也在考虑numpy的解决方案。
答案 0 :(得分:1)
对于您的数据框:
df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12), dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']
start end country letter
0 2017-04-03 2017-04-10 US a
1 2017-04-05 2017-04-12 EU b
2 2017-04-10 2017-04-17 UK c
首先,使用pd.date_range
或DatetimeIndex()
设置新索引,限制第一个和最后一个值:
index = pd.date_range(df['start'].iloc[0], df['end'].iloc[-1])
# or
index = pd.DatetimeIndex(start=df['start'].iloc[0], end=df['end'].iloc[-1], freq='D')
>>
index = pd.DatetimeIndex(['2017-04-03', '2017-04-04', '2017-04-05', '2017-04-06',
'2017-04-07', '2017-04-08', '2017-04-09', '2017-04-10',
'2017-04-11', '2017-04-12', '2017-04-13', '2017-04-14',
'2017-04-15', '2017-04-16', '2017-04-17'],
dtype='datetime64[ns]', freq='D')
然后reindex()
与method=ffill
以及reset_index
和rename
:
df2 = df.set_index(['start']).reindex(index, method='ffill')
df2['Date'] = df2.index
df2.reset_index().rename(columns={'index':'start'})
start end country letter Date
0 2017-04-03 2017-04-10 US a 2017-04-03
1 2017-04-04 2017-04-10 US a 2017-04-04
2 2017-04-05 2017-04-12 EU b 2017-04-05
3 2017-04-06 2017-04-12 EU b 2017-04-06
4 2017-04-07 2017-04-12 EU b 2017-04-07
5 2017-04-08 2017-04-12 EU b 2017-04-08
6 2017-04-09 2017-04-12 EU b 2017-04-09
7 2017-04-10 2017-04-17 UK c 2017-04-10
8 2017-04-11 2017-04-17 UK c 2017-04-11
9 2017-04-12 2017-04-17 UK c 2017-04-12
10 2017-04-13 2017-04-17 UK c 2017-04-13
11 2017-04-14 2017-04-17 UK c 2017-04-14
12 2017-04-15 2017-04-17 UK c 2017-04-15
13 2017-04-16 2017-04-17 UK c 2017-04-16
14 2017-04-17 2017-04-17 UK c 2017-04-17
# Time:
0.009 s
旧答案:
在reindex()
栏上使用start
并使用asfreq()
展开:
df2 = df.set_index(['start']).asfreq('D').fillna(method='ffill').reset_index()
>>>
start end country letter
0 2017-04-03 2017-04-10 US a
1 2017-04-04 2017-04-10 US a
2 2017-04-05 2017-04-12 EU b
3 2017-04-06 2017-04-12 EU b
4 2017-04-07 2017-04-12 EU b
5 2017-04-08 2017-04-12 EU b
6 2017-04-09 2017-04-12 EU b
7 2017-04-10 2017-04-17 UK c
asfreq
文档:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.asfreq.html
答案 1 :(得分:0)
由于这里的目标是速度,我们应该逐步实现每一步。单个for循环可以使代码减慢多个数量级。下面提供了矢量化解决方案:
cols = list(df.columns)
df['data_id'] = np.arange(0, len(df))
data_id = df['data_id']
start = df['start']
end = df['end']
diff = ((end-start) / np.timedelta64(1, 'D')).astype('int') + 1
repeated_id = np.repeat(data_id, diff)
time_df = pd.DataFrame(data={'data_id': repeated_id})
time_df = pd.merge(left=time_df, right=df[['data_id', 'start']], on=['data_id'])
time_df['day_id'] = np.arange(0, len(time_df))
min_day_id = time_df.groupby('data_id')['day_id'].min().reset_index().rename(columns={'day_id': 'min_day_id'})
time_df = pd.merge(left=time_df, right=min_day_id, on=['data_id'])
days_to_add = (time_df['day_id'] - time_df['min_day_id']) * np.timedelta64(1, 'D')
time_df['date'] = time_df['start'] + days_to_add
time_df = time_df[time_df['date'].dt.dayofweek < 5]
df = pd.merge(left=df, right=time_df[['data_id', 'date']], on=['data_id'])
df = df[['date']+cols]
矢量化版本的工作原理如下:
在比较jezrael的定时游戏时,原始方法&#39;在我的机器上花了1.15秒,而矢量化版本花了56.9ms, 20倍加速。