在嵌套的for循环中过滤子数据帧

时间:2020-06-26 06:34:14

标签: python pandas dataframe for-loop nested

我想过滤数据框以在嵌套的for循环中获取子数据集,然后将some_function应用于每个子数据集,根据称为“持续时间”的列从每个子数据集中选择一行TimeDiff,然后将所有单独的行连接到一个数据帧中。

这是代码:

def tm(df):
    total_t = []
    df['YearMonth'] = df['Timestamp'].apply(lambda x: x.strftime('%Y-%m'))
    
    for yearmonth in df['YearMonth'].unique():
        for id in df['Id'].unique():

            sub_df = df[(df['YearMonth'] == yearmonth) &(df['Id'] == id)]
            res_df = some_function(sub_df)
            res_df['TimeDiff'] = res_df['EndTime'] - res_df['StartTime'] 
            res_df = res_df.loc[(res_df['TimeDiff']> datetime.timedelta(seconds=60)) & (res_df['TimeDiff']<datetime.timedelta(minutes=5))]
            long_event = res_df.loc[res_df['TimeDiff'] ==res_df['TimeDiff'].max()]
            
            total_t.append(pd.Series(long_event))
            # total_t.append(pd.Series(long_event))
            total_t = pd.concat([total_t])

    return total_t

tm(dfx)

res_df是一个数据框,如下所示:

    Id  Date        StartTime               EndTime                 StartVal EndVal TimeDiff
0   89  2012-03-10  2012-03-10 00:00:08.483 2012-03-10 00:00:11.607 41.5     41.0   00:00:03.124000
1   181 2012-03-10  2012-03-10 00:02:49.687 2012-03-10 00:02:52.813 41.5     41.0   00:00:03.126000

我想在每个子数据集中选择最长TimeDiff,并且在60seconds5minutes范围内的行,以将它们合并为一个数据帧。

但是它捕获了错误:

TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

我意识到,这可能是由于以下事实:作为参数传递的数据帧应采用基于this questionlist形式。我尝试过

total_t = pd.concat([total_t])
# Original code :total_t = pd.concat(total_t) 

返回了:

TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid

预期的输出,仍然采用这种格式,但有更多行:

    Id  Date        StartTime               EndTime                 StartVal EndVal TimeDiff
0   89  2012-03-10  2012-03-10 00:00:08.483 2012-03-10 00:00:11.607 41.5     41.0   00:00:03.124000
1   181 2012-03-10  2012-03-10 00:02:49.687 2012-03-10 00:02:52.813 41.5     41.0   00:00:03.126000
                                            ...

更新:

尝试:

            total_t.append(long_event)
            total_t = pd.Series(total_t)
#             total_t = pd.DataFrame(pd.concat([total_t]))
            total_t = pd.concat([total_t])

仅返回一行:

0     Id   Date       StartTime               EndTime                   StartVal 
EndVal TimeDiff
37    235  2012-03-10 2012-03-10 19:43:32.260 2012-03-10 19:48:06.270   42.0 
41.5   00:04:34.010000
dtype: object

1 个答案:

答案 0 :(得分:0)

IIUC您要返回一个pd.DataFrame,其中每个YearMonth和每个Id在5分钟内包含最大TimeDiff。是吗?

首先对您的代码发表评论:

  • total_t.append(pd.Series(long_event))的第一个迭代中total_t是一个列表,因此您可以像往常一样附加到位。
  • 下一行:total_t = pd.concat([total_t])将您的列表转换为系列
  • 从第二次迭代开始,将执行total_t.append(pd.Series(long_event)),但是当total_t成为pd.Series时,附加操作不再有效。

我建议您进行以下调整:

要获取与最大TimeDiff对应的行,可以使用以下代码:

long_event = res_df.loc[res_df['TimeDiff'].idxmax()]

然后,您可以像已经执行的操作一样附加它们(无需转换为pd.Series):

total_t.append(long_event)

最后,我将通过在返回之前添加total_t = pd.concat([total_t])来删除

total_t = pd.DataFrame(total_t)

最终代码如下:

            long_event = res_df.loc[res_df['TimeDiff'].idxmax()]
            total_t.append(long_event)

    total_t = pd.DataFrame(total_t)
    return total_t

使用示例数据进行测试

创建虚假数据

df = pd.DataFrame({'TimeDiff': ['00:01:13.124000', '00:00:03.124000', '00:12:03.126000', '00:04:54.124000'], 
                   'Other': [1, 2, 3, 4]})
df['TimeDiff'] = pd.to_timedelta(df.TimeDiff)

结果

df = df.loc[(df['TimeDiff']> datetime.timedelta(seconds=60)) & (df['TimeDiff']<datetime.timedelta(minutes=5))]
total_t = []
long_event = df.loc[df['TimeDiff'].idxmax()]
total_t.append(long_event)

# after the for loops are completed in your case
total_t = pd.DataFrame(total_t)
total_t
    TimeDiff                    Other
3   0 days 00:04:54.124000000   4

这里的Other列只是为了证明结果是DataFrame,因此最后包含所有其他列。

相关问题