我目前有一个约120万行的数据集,其中包含日期,小时和间隔(小时的四分之一)列。
Delivery Date Delivery Hour Delivery Interval
1-1-2017 1 1
1-1-2017 1 2
1-1-2017 1 3
1-1-2017 1 4
我目前有一个for循环,它将这些列合并为一个,但是我正在寻找一种更快的方法,因为这将需要几个小时才能完成运行。
for i in range(len(df_rt['Delivery Interval'])):
hour = int(df_rt['Delivery Hour'][i]) - 1
minute = (int(df_rt['Delivery Interval'][i]) - 1)*15
df_rt['Time'][i] = str(hour) + ':' + str(minute)
df_rt['DateTime'] = df_rt['Delivery Date'] + " " + df_rt['Time']
df_rt['DateTime'] = pd.to_datetime(df_rt['DateTime'])
答案 0 :(得分:1)
这不是最干净的解决方案,但可以避免在数据帧中进行显式循环:
df['DateTime'] = pd.to_datetime(df['Delivery Date'].astype(str) + ' ' +
(df['Delivery Hour'].astype(int)-1).astype(str) + ':' +
((df['Delivery Interval'].astype(int)-1)*15).astype(str))
给出示例数据框,结果如下:
Delivery Date Delivery Hour Delivery Interval DateTime
0 1-1-2017 1 1 2017-01-01 00:00:00
1 1-1-2017 1 2 2017-01-01 00:15:00
2 1-1-2017 1 3 2017-01-01 00:30:00
3 1-1-2017 1 4 2017-01-01 00:45:00
答案 1 :(得分:0)
以下方法应该起作用(方法1):
df_rt['datetime'] = pd.to_datetime(df_rt['Delivery Date']) \
+ pd.to_timedelta(df_rt['Delivery Hour'] - 1, unit='h') \
+ pd.to_timedelta(15*(df_rt['Delivery Interval'] - 1), unit='m')
一种更快的方法(方法2):
def format_row(row):
return f'{row["Delivery Date"]} {row["Delivery Hour"] - 1}:{15*(row["Delivery Interval"] - 1)}'
pd.to_datetime(df_rt.apply(format_row, axis='columns'), format='%d-%m-%Y %H:%M')
时间:
方法1
2.53 ms ± 86.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
方法2
1.21 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
答案 2 :(得分:0)
一种有趣的方法是将序列转换为datetime
对象的组件,然后将这些组件的数据帧传递给pd.to_datetime
:
df[['month', 'day', 'year']] = df['DeliveryDate'].str.split('-', expand=True)
df['DeliveryHour'] -= 1
df['DeliveryInterval'] = (df['DeliveryInterval'] - 1) * 15
df = df.rename(columns={'DeliveryHour': 'hour', 'DeliveryInterval': 'minute'})
print(pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']]))
0 2017-01-01 00:00:00
1 2017-01-01 00:15:00
2 2017-01-01 00:30:00
3 2017-01-01 00:45:00
dtype: datetime64[ns]