我有一个如下所示的数据框:
START_TIME END_TIME TRIAL_No itemnr
2403950 2413067 Trial: 1 P14
2413378 2422499 Trial: 2 P03
2422814 2431931 Trial: 3 P13
2432246 2441363 Trial: 4 P02
2523540 2541257 Trial: 5 P11
2541864 2560297 Trial: 6 P10
2560916 2577249 Trial: 7 P05
桌子一直在继续。 START_TIME和END_TIME都以毫秒为单位,这是试验的开始和结束时间。所以我想要做的是,我想将START_TIME重新采样到100毫秒bin itme并在每个START_TIME和END_TIME之间插入变量(TRIAL_No和itemnr)。在这些区域之外,这些变量的值应为“NA”。例如,对于第一行,START_TIME是2403950,END_TIME是2413067.它们之间的差异是9117毫秒。因此,“试验:1”保持9117毫秒,这是因为每个箱时间相隔100毫秒,因此需要91个箱时间。所以我想在结果数据帧中重复“Trial_1”和“P14”91次。其余部分也是如此。看起来如下:
Bin_time TRIAL_No itemnr
2403950 Trial: 1 P14
2404050 Trial: 1 P14
2404150 Trial: 1 P14
...
2413050 Trial: 1 P14
2413150 Trial: 2 P03
2413250 Trial: 2 P03
等等。我不确定是否可以直接在熊猫中进行,或者需要进行一些预处理。
答案 0 :(得分:1)
按concat
数据框创建新数据框后,我可以按行对其进行分组,并在每个组上应用resample
(使用方法ffill
转发填充)。
print df
# START_TIME END_TIME TRIAL_No itemnr
#0 2403950 2413067 Trial: 1 P14
#1 2413378 2422499 Trial: 2 P03
#2 2422814 2431931 Trial: 3 P13
#3 2432246 2441363 Trial: 4 P02
#4 2523540 2541257 Trial: 5 P11
#5 2541864 2560297 Trial: 6 P10
#6 2560916 2577249 Trial: 7 P05
#PREDPROCESSING
#helper column for matching start and end rows
df['row'] = range(len(df))
#reshape to df - every row two times repeated for each date of START_TIME and END_TIME
starts = df[['START_TIME','TRIAL_No','itemnr','row']].rename(columns={'START_TIME':'Bin_time'})
ends = df[['END_TIME','TRIAL_No','itemnr','row']].rename(columns={'END_TIME':'Bin_time'})
df = pd.concat([starts, ends])
df = df.set_index('row', drop=True)
df = df.sort_index()
#convert miliseconds to timedelta for resampling by time 100ms
df['Bin_time'] = df['Bin_time'].astype('timedelta64[ms]')
print df
# Bin_time TRIAL_No itemnr
#row
#0 00:40:03.950000 Trial: 1 P14
#0 00:40:13.067000 Trial: 1 P14
#1 00:40:13.378000 Trial: 2 P03
#1 00:40:22.499000 Trial: 2 P03
#2 00:40:22.814000 Trial: 3 P13
#2 00:40:31.931000 Trial: 3 P13
#3 00:40:32.246000 Trial: 4 P02
#3 00:40:41.363000 Trial: 4 P02
#4 00:42:03.540000 Trial: 5 P11
#4 00:42:21.257000 Trial: 5 P11
#5 00:42:21.864000 Trial: 6 P10
#5 00:42:40.297000 Trial: 6 P10
#6 00:42:40.916000 Trial: 7 P05
#6 00:42:57.249000 Trial: 7 P05
print df.dtypes
#Bin_time timedelta64[ms]
#TRIAL_No object
#itemnr object
#dtype: object
#resample and fill missing data
df = df.groupby(df.index).apply(lambda x: x.set_index('Bin_time').resample('100ms',how='first',fill_method='ffill'))
df = df.reset_index()
df = df.drop(['row'], axis=1)
#convert timedelta to integer back
df['Bin_time'] = (df['Bin_time'] / np.timedelta64(1, 'ms')).astype(int)
print df.head()
# Bin_time TRIAL_No itemnr
#0 2403950 Trial: 1 P14
#1 2404050 Trial: 1 P14
#2 2404150 Trial: 1 P14
#3 2404250 Trial: 1 P14
#4 2404350 Trial: 1 P14
编辑:
如果您希望在群组之外获取NaN
,则可以在groupby
之后更改代码:
#resample and fill missing data
df = df.groupby(df.index).apply(lambda x: x.set_index('Bin_time').resample('100ms', how='first',fill_method='ffill'))
#reset only first level - drop index row
df = df.reset_index(level=0, drop=True)
#resample by 100ms, outside are NaN
df = df.resample('100ms', how='first')
df = df.reset_index()
#convert timedelta to integer back
df['Bin_time'] = (df['Bin_time'] / np.timedelta64(1, 'ms')).astype(int)
print df