我正在使用一个数据帧,其中每个条目(行)都带有开始时间,持续时间和其他属性。我想从此数据库中创建一个新的数据框,在该数据框中,我会将每个条目从原始条目转换为15分钟间隔,同时保持所有其他属性不变。新数据帧中旧条目中的每个条目的数量将取决于原始条目的实际持续时间。
起初,我尝试使用pd.resample,但它并没有完全符合我的预期。然后,我使用itertuples()
构造了一个功能不错的函数,但是花了大约半小时的时间才能处理约3000行的数据帧。现在,我想对200万行执行相同的操作,因此我正在寻找其他可能性。
假设我具有以下数据框:
testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm'], 'id': [1,2,3,4]}
testdf = pd.DataFrame(testdict)
testdf.loc[:,['start']] = pd.to_datetime(testdf['start'])
print(testdf)
>>>testdf
start duration Attribute_A id
0 2018-01-05 11:48:00 22 abc 1
1 2018-05-04 09:05:00 8 def 2
2 2018-08-09 07:15:00 35 hij 3
3 2018-09-27 15:00:00 2 klm 4
我希望我的结果像下面这样:
>>>resultdf
start duration Attribute_A id
0 2018-01-05 11:45:00 12 abc 1
1 2018-01-05 12:00:00 10 abc 1
2 2018-05-04 09:00:00 8 def 2
3 2018-08-09 07:15:00 15 hij 3
4 2018-08-09 07:30:00 15 hij 3
5 2018-08-09 07:45:00 5 hij 3
6 2018-09-27 15:00:00 2 klm 4
这是我使用itupuples构建的函数,它产生了预期的结果(我在此上方显示的结果):
def min15_divider(df,newdf):
for row in df.itertuples():
orig_min = row.start.minute
remains = orig_min % 15 # Check if it is already a multiple of 15
if remains == 0:
new_time = row.start.replace(second=0)
if row.duration < 15: # if it shorter than 15 min just use that for the duration
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,
'duration': row.duration, 'id':row.id}
newdf = newdf.append(to_append, ignore_index=True)
else: # if not, divide that in 15 min intervals until duration is exceeded
cumu_dur = 15
while cumu_dur < row.duration:
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
if cumu_dur < 15:
to_append['duration'] = cumu_dur
else:
to_append['duration'] = 15
new_time = new_time + pd.Timedelta('15 minutes')
cumu_dur = cumu_dur + 15
newdf = newdf.append(to_append, ignore_index=True)
else: # add the remainder in the last 15 min interval
final_dur = row.duration - (cumu_dur - 15)
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,'duration': final_dur, 'id':row.id}
newdf = newdf.append(to_append, ignore_index=True)
else: # When it is not an exact multiple of 15 min
new_min = orig_min - remains # convert to multiple of 15
new_time = row.start.replace(minute=new_min)
new_time = new_time.replace(second=0)
cumu_dur = 15 - remains # remaining minutes in the initial interval
while cumu_dur < row.duration: # divide total in 15 min intervals until duration is exceeded
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
if cumu_dur < 15:
to_append['duration'] = cumu_dur
else:
to_append['duration'] = 15
new_time = new_time + pd.Timedelta('15 minutes')
cumu_dur = cumu_dur + 15
newdf = newdf.append(to_append, ignore_index=True)
else: # when we reach the last interval or the starting duration was less than the remaining minutes
if row.duration < 15:
final_dur = row.duration # original duration less than remaining minutes in first interval
else:
final_dur = row.duration - (cumu_dur - 15) # remaining duration in last interval
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'duration': final_dur, 'id':row.id}
newdf = newdf.append(to_append, ignore_index=True)
return newdf
是否可以使用其他其他方式来节省我一些时间而无需使用itertuples
?
谢谢。
PS。对于在我的帖子中可能看起来有些奇怪的任何事情,我深表歉意,因为这是我第一次在stackoverflow中问自己一个问题。
许多条目可以具有相同的开始时间,因此.groupby
'开始'可能会出现问题。但是,对于每个条目,都有一列具有唯一值的列,简称为“ id”。
答案 0 :(得分:0)
使用pd.resample
是个好主意,但是由于每行只有开始时间,因此需要先构建结束行才能使用它。
下面的代码假定 'start'
列中的每个开始时间都是唯一的,因此grouby
可以以一种不寻常的方式使用,因为它只会提取一个行。
我使用groupby
是因为它会自动重新组合apply
使用的自定义函数所产生的数据帧。
还要注意,'duration'
列会在几分钟内转换为timedelta
,以便以后更好地执行一些数学运算。
import pandas as pd
testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm']}
testdf = pd.DataFrame(testdict)
testdf['start'] = pd.to_datetime(testdf['start'])
testdf['duration'] = pd.to_timedelta(testdf['duration'], 'T')
print(testdf)
def calcduration(df, starttime):
if len(df) == 1:
return
elif len(df) == 2:
df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
df['duration'].iloc[1] = df['duration'].iloc[1] - df['duration'].iloc[0]
elif len(df) > 2:
df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
df['duration'].iloc[1:-1] = pd.Timedelta(15, 'T')
df['duration'].iloc[-1] = df['duration'].iloc[-1] - df['duration'].iloc[:-1].sum()
def expandtime(x):
frow = x.copy()
frow['start'] = frow['start'] + frow['duration']
gdf = pd.concat([x, frow], axis=0)
gdf = gdf.set_index('start')
resdf = gdf.resample('15T').nearest()
calcduration(resdf, x['start'].iloc[0])
return resdf
findf = testdf.groupby('start', as_index=False).apply(expandtime)
print(findf)
此代码产生:
duration Attribute_A
start
0 2018-01-05 11:45:00 00:12:00 abc
2018-01-05 12:00:00 00:10:00 abc
1 2018-05-04 09:00:00 00:08:00 def
2 2018-08-09 07:15:00 00:15:00 hij
2018-08-09 07:30:00 00:15:00 hij
2018-08-09 07:45:00 00:05:00 hij
3 2018-09-27 15:00:00 00:02:00 klm
expandtime
是第一个自定义函数。它需要一行的数据帧(因为我们假设'start'
的值是唯一的),构建第二行,其'start'
等于第一行的'start'
+持续时间,然后使用{ {1}}以15分钟的时间间隔对其进行采样。所有其他列的值都重复。
resample
用于对calcduration
列进行一些数学运算,以便计算每行的正确持续时间。
答案 1 :(得分:0)
因此,从您的df开始:
testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm']}
df = pd.DataFrame(testdict)
df.loc[:,['start']] = pd.to_datetime(df['start'])
print(df)
首先计算每行的结束时间:
df['dur'] = pd.to_timedelta(df['duration'], unit='m')
df['end'] = df['start'] + df['dur']
然后创建两个新列,以固定的间隔(15分钟)开始和结束日期:
df['start15'] = df['start'].dt.floor('15min')
df['end15'] = df['end'].dt.floor('15min')
此时,数据框如下所示:
Attribute_A duration start dur end start15 end15
0 abc 22 2018-01-05 11:48:00 00:22:00 2018-01-05 12:10:00 2018-01-05 11:45:00 2018-01-05 12:00:00
1 def 8 2018-05-04 09:05:00 00:08:00 2018-05-04 09:13:00 2018-05-04 09:00:00 2018-05-04 09:00:00
2 hij 35 2018-08-09 07:15:00 00:35:00 2018-08-09 07:50:00 2018-08-09 07:15:00 2018-08-09 07:45:00
3 klm 2 2018-09-27 15:00:00 00:02:00 2018-09-27 15:02:00 2018-09-27 15:00:00 2018-09-27 15:00:00
start15
和end15
列的合并时间正确,但是您需要合并它们:
df = pd.melt(df, ['dur', 'start', 'Attribute_A', 'end'], ['start15', 'end15'], value_name='start15')
df = df.drop('variable', 1).drop_duplicates('start15').sort_values('start15').set_index('start15')
输出:
dur start Attribute_A
start15
2018-01-05 11:45:00 00:22:00 2018-01-05 11:48:00 abc
2018-01-05 12:00:00 00:22:00 2018-01-05 11:48:00 abc
2018-05-04 09:00:00 00:08:00 2018-05-04 09:05:00 def
2018-08-09 07:15:00 00:35:00 2018-08-09 07:15:00 hij
2018-08-09 07:45:00 00:35:00 2018-08-09 07:15:00 hij
2018-09-27 15:00:00 00:02:00 2018-09-27 15:00:00 klm
看起来不错,但是缺少2018-08-09 07:30:00
行。使用groupby填写此行以及其他所有丢失的行并重新采样:
df = df.groupby('start').resample('15min').ffill().reset_index(0, drop=True).reset_index()
取回end15
列,它在较早的熔化操作期间已删除:
df['end15'] = df['end'].dt.floor('15min')
然后为每一行计算正确的持续时间。我将其分为两个计算(持续时间跨多个时间步长而没有的持续时间)以保持可读性:
df.loc[df['start15'] != df['end15'], 'duration'] = np.minimum(df['end15'] - df['start'], pd.Timedelta('15min').to_timedelta64())
df.loc[df['start15'] == df['end15'], 'duration'] = np.minimum(df['end'] - df['end15'], df['end'] - df['start'])
然后进行一些清理,使其看起来像您想要的:
df['duration'] = (df['duration'].dt.seconds/60).astype(int)
print(df)
df = df[['start15', 'duration', 'Attribute_A']].copy()
结果:
start15 duration Attribute_A
0 2018-01-05 11:45:00 12 abc
1 2018-01-05 12:00:00 10 abc
2 2018-05-04 09:00:00 8 def
3 2018-08-09 07:15:00 15 hij
4 2018-08-09 07:30:00 15 hij
5 2018-08-09 07:45:00 5 hij
6 2018-09-27 15:00:00 2 klm
请注意,此答案的某些部分基于this answer