我有一个带有日期时间间隔的DataFrame,如下所示:
id start_date end_date 1 1 2016-10-01 00:00:00 2016-10-01 03:00:00 2 1 2016-10-03 05:30:00 2016-10-03 06:30:00 3 2 2016-10-03 23:30:00 2016-10-04 01:00:00 # This line should be splitted 4 1 2016-10-04 05:00:00 2016-10-04 06:00:00 5 2 2016-10-04 05:50:00 2016-10-04 06:00:00 6 1 2016-10-05 18:50:00 2016-10-06 02:00:00 # This one too ....
我想“分割”超过一天的间隔,以确保每一行都在同一天:
id start_date end_date 1 1 2016-10-01 00:00:00 2016-10-01 03:00:00 2 1 2016-10-03 05:30:00 2016-10-03 06:30:00 3 2 2016-10-03 23:30:00 2016-10-03 23:59:59 # Splitted 4 2 2016-10-04 00:00:00 2016-10-04 01:00:00 # Splitted 5 1 2016-10-04 05:00:00 2016-10-04 06:00:00 6 2 2016-10-04 05:50:00 2016-10-04 06:00:00 7 1 2016-10-05 18:50:00 2016-10-05 23:59:59 # Splitted 8 1 2016-10-06 00:00:00 2016-10-06 02:00:00 # Splitted ....
答案 0 :(得分:2)
您可以使用.dt
accessor创建执行更新的布尔索引,然后相应地进行调整:
# Get the rows to split.
split_rows = (df['start_date'].dt.date != df['end_date'].dt.date)
# Get the new rows to append, adjusting the start_date to the next day.
new_rows = df[split_rows].copy()
new_rows['start_date'] = new_rows['start_date'].dt.date + pd.DateOffset(days=1)
# Adjust the end_date of the existing rows.
df.loc[split_rows, 'end_date'] = df.loc[split_rows, 'start_date'].dt.date + pd.DateOffset(days=1, seconds=-1)
# Append the new rows to the existing dataframe.
df = df.append(new_rows).sort_index().reset_index(drop=True)
上述过程假设start_date
和end_date
之间的日期差异只有一天。如果有可能存在多天跨度,您可以将上述过程包装在while
循环中:
# Get the rows to split.
split_rows = (df['start_date'].dt.date != df['end_date'].dt.date)
while split_rows.any():
# Get the new rows, adjusting the start_date to the next day.
new_rows = df[split_rows].copy()
new_rows['start_date'] = new_rows['start_date'].dt.date + pd.DateOffset(days=1)
# Adjust the end_date of the existing rows.
df.loc[split_rows, 'end_date'] = df.loc[split_rows, 'start_date'].dt.date + pd.DateOffset(days=1, seconds=-1)
# Append the new rows to the existing dataframe.
df = df.append(new_rows).sort_index().reset_index(drop=True)
# Get new rows to split (if the start_date to end_date span is more than 1 day).
split_rows = (df['start_date'].dt.date != df['end_date'].dt.date)
样本数据的结果输出:
id start_date end_date
0 1 2016-10-01 00:00:00 2016-10-01 03:00:00
1 1 2016-10-03 05:30:00 2016-10-03 06:30:00
2 2 2016-10-03 23:30:00 2016-10-03 23:59:59
3 2 2016-10-04 00:00:00 2016-10-04 01:00:00
4 1 2016-10-04 05:00:00 2016-10-04 06:00:00
5 2 2016-10-04 05:50:00 2016-10-04 06:00:00
6 1 2016-10-05 18:50:00 2016-10-05 23:59:59
7 1 2016-10-06 00:00:00 2016-10-06 02:00:00
答案 1 :(得分:1)
这有效:
def date_split(row):
starts = pd.Series(pd.date_range(row['start_date'].date(),
periods=row['diff']+1, freq='D'))
starts[0] = row['start_date']
ends = starts[1:] - pd.to_timedelta(1, unit='s')
ends.loc[len(ends)+1] = row['end_date']
ends.reset_index(drop=True, inplace=True)
ret = pd.concat([starts, ends], axis=1, keys=['start_date', 'end_date'])
ret['id'] = row['id']
return ret
df['diff'] = df['end_date'].dt.day - df['start_date'].dt.day
req = pd.concat([df[df['diff'] == 0]] +\
[date_split(row) for _, row in df[df['diff'] > 0].iterrows()])
req = req.drop('diff', axis=1).reset_index(drop=True)
req
请注意,这是一般方法,可以处理两者之间的任意天数。只有你的指数位置会有所不同。