合并大型数据集的日期,小时和间隔列的更快方法

时间:2018-10-08 20:46:23

标签: python pandas datetime for-loop series

我目前有一个约120万行的数据集,其中包含日期,小时和间隔(小时的四分之一)列。

Delivery Date      Delivery Hour Delivery Interval
1-1-2017           1             1
1-1-2017           1             2
1-1-2017           1             3
1-1-2017           1             4

我目前有一个for循环,它将这些列合并为一个,但是我正在寻找一种更快的方法,因为这将需要几个小时才能完成运行。

for i in range(len(df_rt['Delivery Interval'])):

    hour = int(df_rt['Delivery Hour'][i]) - 1
    minute = (int(df_rt['Delivery Interval'][i]) - 1)*15
    df_rt['Time'][i] = str(hour) + ':' + str(minute)

df_rt['DateTime'] = df_rt['Delivery Date'] + " " + df_rt['Time']
df_rt['DateTime'] = pd.to_datetime(df_rt['DateTime'])

3 个答案:

答案 0 :(得分:1)

这不是最干净的解决方案,但可以避免在数据帧中进行显式循环:

df['DateTime'] = pd.to_datetime(df['Delivery Date'].astype(str) + ' ' + 
    (df['Delivery Hour'].astype(int)-1).astype(str) + ':' + 
    ((df['Delivery Interval'].astype(int)-1)*15).astype(str))

给出示例数据框,结果如下:

  Delivery Date  Delivery Hour  Delivery Interval            DateTime
0      1-1-2017              1                  1 2017-01-01 00:00:00
1      1-1-2017              1                  2 2017-01-01 00:15:00
2      1-1-2017              1                  3 2017-01-01 00:30:00
3      1-1-2017              1                  4 2017-01-01 00:45:00

答案 1 :(得分:0)

以下方法应该起作用(方法1):

df_rt['datetime'] = pd.to_datetime(df_rt['Delivery Date']) \
                    + pd.to_timedelta(df_rt['Delivery Hour'] - 1, unit='h') \
                    + pd.to_timedelta(15*(df_rt['Delivery Interval'] - 1), unit='m')

一种更快的方法(方法2):

def format_row(row):
    return f'{row["Delivery Date"]} {row["Delivery Hour"] - 1}:{15*(row["Delivery Interval"] - 1)}'
pd.to_datetime(df_rt.apply(format_row, axis='columns'), format='%d-%m-%Y %H:%M')

时间:

方法1

2.53 ms ± 86.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

方法2

1.21 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

答案 2 :(得分:0)

一种有趣的方法是将序列转换为datetime对象的组件,然后将这些组件的数据帧传递给pd.to_datetime

df[['month', 'day', 'year']] = df['DeliveryDate'].str.split('-', expand=True)
df['DeliveryHour'] -= 1
df['DeliveryInterval'] = (df['DeliveryInterval'] - 1) * 15
df = df.rename(columns={'DeliveryHour': 'hour', 'DeliveryInterval': 'minute'})

print(pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']]))

0   2017-01-01 00:00:00
1   2017-01-01 00:15:00
2   2017-01-01 00:30:00
3   2017-01-01 00:45:00
dtype: datetime64[ns]