我想过滤数据框以在嵌套的for循环中获取子数据集,然后将some_function
应用于每个子数据集,根据称为“持续时间”的列从每个子数据集中选择一行TimeDiff
,然后将所有单独的行连接到一个数据帧中。
这是代码:
def tm(df):
total_t = []
df['YearMonth'] = df['Timestamp'].apply(lambda x: x.strftime('%Y-%m'))
for yearmonth in df['YearMonth'].unique():
for id in df['Id'].unique():
sub_df = df[(df['YearMonth'] == yearmonth) &(df['Id'] == id)]
res_df = some_function(sub_df)
res_df['TimeDiff'] = res_df['EndTime'] - res_df['StartTime']
res_df = res_df.loc[(res_df['TimeDiff']> datetime.timedelta(seconds=60)) & (res_df['TimeDiff']<datetime.timedelta(minutes=5))]
long_event = res_df.loc[res_df['TimeDiff'] ==res_df['TimeDiff'].max()]
total_t.append(pd.Series(long_event))
# total_t.append(pd.Series(long_event))
total_t = pd.concat([total_t])
return total_t
tm(dfx)
res_df
是一个数据框,如下所示:
Id Date StartTime EndTime StartVal EndVal TimeDiff
0 89 2012-03-10 2012-03-10 00:00:08.483 2012-03-10 00:00:11.607 41.5 41.0 00:00:03.124000
1 181 2012-03-10 2012-03-10 00:02:49.687 2012-03-10 00:02:52.813 41.5 41.0 00:00:03.126000
我想在每个子数据集中选择最长TimeDiff
,并且在60seconds
至5minutes
范围内的行,以将它们合并为一个数据帧。
但是它捕获了错误:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
我意识到,这可能是由于以下事实:作为参数传递的数据帧应采用基于this question的list
形式。我尝试过
total_t = pd.concat([total_t])
# Original code :total_t = pd.concat(total_t)
返回了:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
预期的输出,仍然采用这种格式,但有更多行:
Id Date StartTime EndTime StartVal EndVal TimeDiff
0 89 2012-03-10 2012-03-10 00:00:08.483 2012-03-10 00:00:11.607 41.5 41.0 00:00:03.124000
1 181 2012-03-10 2012-03-10 00:02:49.687 2012-03-10 00:02:52.813 41.5 41.0 00:00:03.126000
...
更新:
尝试:
total_t.append(long_event)
total_t = pd.Series(total_t)
# total_t = pd.DataFrame(pd.concat([total_t]))
total_t = pd.concat([total_t])
仅返回一行:
0 Id Date StartTime EndTime StartVal
EndVal TimeDiff
37 235 2012-03-10 2012-03-10 19:43:32.260 2012-03-10 19:48:06.270 42.0
41.5 00:04:34.010000
dtype: object
答案 0 :(得分:0)
IIUC您要返回一个pd.DataFrame
,其中每个YearMonth
和每个Id
在5分钟内包含最大TimeDiff
。是吗?
首先对您的代码发表评论:
total_t.append(pd.Series(long_event))
的第一个迭代中total_t
是一个列表,因此您可以像往常一样附加到位。total_t = pd.concat([total_t])
将您的列表转换为系列total_t.append(pd.Series(long_event))
,但是当total_t
成为pd.Series
时,附加操作不再有效。我建议您进行以下调整:
要获取与最大TimeDiff对应的行,可以使用以下代码:
long_event = res_df.loc[res_df['TimeDiff'].idxmax()]
然后,您可以像已经执行的操作一样附加它们(无需转换为pd.Series
):
total_t.append(long_event)
最后,我将通过在返回之前添加total_t = pd.concat([total_t])
来删除
total_t = pd.DataFrame(total_t)
最终代码如下:
long_event = res_df.loc[res_df['TimeDiff'].idxmax()]
total_t.append(long_event)
total_t = pd.DataFrame(total_t)
return total_t
使用示例数据进行测试
创建虚假数据
df = pd.DataFrame({'TimeDiff': ['00:01:13.124000', '00:00:03.124000', '00:12:03.126000', '00:04:54.124000'],
'Other': [1, 2, 3, 4]})
df['TimeDiff'] = pd.to_timedelta(df.TimeDiff)
结果
df = df.loc[(df['TimeDiff']> datetime.timedelta(seconds=60)) & (df['TimeDiff']<datetime.timedelta(minutes=5))]
total_t = []
long_event = df.loc[df['TimeDiff'].idxmax()]
total_t.append(long_event)
# after the for loops are completed in your case
total_t = pd.DataFrame(total_t)
total_t
TimeDiff Other
3 0 days 00:04:54.124000000 4
这里的Other
列只是为了证明结果是DataFrame
,因此最后包含所有其他列。