我有很多数据点,每个数据点都有两列:start_dt
和end_dt
。我想知道如何将start_dt和end_dt之间的时间间隔分成5分钟?
例如,
id +++++++ start_tm ++++++++++++++ end_dt
1 +++++++ 2019-01-01 10:00 +++++++ 2019-01-01 11:00
================================================ ======
我正在寻找的是:
id +++++++ start_tm ++++++++++++++ end_dt
1 +++++++ 2019-01-01 10:00 +++++++ 2019-01-01 10:05
1 +++++++ 2019-01-01 10:05 +++++++ 2019-01-01 10:10
1 +++++++ 2019-01-01 10:10 +++++++ 2019-01-01 10:15
1 +++++++ 2019-01-01 10:15 +++++++ 2019-01-01 10:20
================================================ ===
还有堡垒
有没有开箱即用的功能?
如果没有,那么创建此功能的任何帮助都是很棒的
答案 0 :(得分:1)
如果您有两个表示日期跨度的Python datetime对象,而您只想将该日期跨度划分为由datetime对象代表的5分钟间隔,则可以执行以下操作:
import datetime
d1 = datetime.datetime(2019, 1, 1, 10, 0)
d2 = datetime.datetime(2019, 1, 1, 11, 0)
delta = datetime.timedelta(minutes=5)
times = []
while d1 < d2:
times.append(d1)
d1 += delta
times.append(d2)
for i in range(len(times) - 1):
print("{} - {}".format(times[i], times[i+1]))
输出:
2019-01-01 10:00:00 - 2019-01-01 10:05:00
2019-01-01 10:05:00 - 2019-01-01 10:10:00
2019-01-01 10:10:00 - 2019-01-01 10:15:00
2019-01-01 10:15:00 - 2019-01-01 10:20:00
2019-01-01 10:20:00 - 2019-01-01 10:25:00
2019-01-01 10:25:00 - 2019-01-01 10:30:00
2019-01-01 10:30:00 - 2019-01-01 10:35:00
2019-01-01 10:35:00 - 2019-01-01 10:40:00
2019-01-01 10:40:00 - 2019-01-01 10:45:00
2019-01-01 10:45:00 - 2019-01-01 10:50:00
2019-01-01 10:50:00 - 2019-01-01 10:55:00
2019-01-01 10:55:00 - 2019-01-01 11:00:00
这应该处理的周期不是增量的偶数倍,从而使您的间隔间隔更短。
答案 1 :(得分:1)
我不了解pyspark,但是如果您使用的是熊猫,则可以使用。 (和pyspark可能相似):
1:创建数据
import pandas as pd
import numpy as np
data = pd.DataFrame({
'id':[1, 2],
'start_tm': pd.date_range('2019-01-01 00:00', periods=2, freq='D'),
'end_dt': pd.date_range('2019-01-01 00:30', periods=2, freq='D')})
# pandas dataframe is similar to the data in pyspark
输出
id start_tm end_dt
1 2019-01-01 2019-01-01 00:30:00
2 2019-01-02 2019-01-02 00:30:00
2:拆分列
period = np.timedelta64(5, 'm') # 5 minutes
idx = (data['end_dt'] - data['start_tm']) > period
while idx.any():
new_data = data[idx].copy()
new_data['start_tm'] = new_data['start_tm'] + period
data.loc[idx, 'end_dt'] = (data[idx]['start_tm'] + period).values
data = pd.concat([data, new_data], axis=0)
idx = (data['end_dt'] - data['start_tm']) > period
输出
id start_tm end_dt
1 2019-01-01 00:00:00 2019-01-01 00:05:00
2 2019-01-02 00:00:00 2019-01-02 00:05:00
1 2019-01-01 00:05:00 2019-01-01 00:10:00
2 2019-01-02 00:05:00 2019-01-02 00:10:00
1 2019-01-01 00:10:00 2019-01-01 00:15:00
2 2019-01-02 00:10:00 2019-01-02 00:15:00
1 2019-01-01 00:15:00 2019-01-01 00:20:00
2 2019-01-02 00:15:00 2019-01-02 00:20:00
1 2019-01-01 00:20:00 2019-01-01 00:25:00
2 2019-01-02 00:20:00 2019-01-02 00:25:00
1 2019-01-01 00:25:00 2019-01-01 00:30:00
2 2019-01-02 00:25:00 2019-01-02 00:30:00