Question

我想过滤数据框以在嵌套的for循环中获取子数据集，然后将some_function应用于每个子数据集，根据称为“持续时间”的列从每个子数据集中选择一行TimeDiff，然后将所有单独的行连接到一个数据帧中。

这是代码：

def tm(df):
    total_t = []
    df['YearMonth'] = df['Timestamp'].apply(lambda x: x.strftime('%Y-%m'))
    
    for yearmonth in df['YearMonth'].unique():
        for id in df['Id'].unique():

            sub_df = df[(df['YearMonth'] == yearmonth) &(df['Id'] == id)]
            res_df = some_function(sub_df)
            res_df['TimeDiff'] = res_df['EndTime'] - res_df['StartTime'] 
            res_df = res_df.loc[(res_df['TimeDiff']> datetime.timedelta(seconds=60)) & (res_df['TimeDiff']<datetime.timedelta(minutes=5))]
            long_event = res_df.loc[res_df['TimeDiff'] ==res_df['TimeDiff'].max()]
            
            total_t.append(pd.Series(long_event))
            # total_t.append(pd.Series(long_event))
            total_t = pd.concat([total_t])

    return total_t

tm(dfx)

res_df是一个数据框，如下所示：

    Id  Date        StartTime               EndTime                 StartVal EndVal TimeDiff
0   89  2012-03-10  2012-03-10 00:00:08.483 2012-03-10 00:00:11.607 41.5     41.0   00:00:03.124000
1   181 2012-03-10  2012-03-10 00:02:49.687 2012-03-10 00:02:52.813 41.5     41.0   00:00:03.126000

我想在每个子数据集中选择最长TimeDiff，并且在60seconds至5minutes范围内的行，以将它们合并为一个数据帧。

但是它捕获了错误：

TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

我意识到，这可能是由于以下事实：作为参数传递的数据帧应采用基于this question的list形式。我尝试过

total_t = pd.concat([total_t])
# Original code :total_t = pd.concat(total_t)

返回了：

TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid

预期的输出，仍然采用这种格式，但有更多行：

    Id  Date        StartTime               EndTime                 StartVal EndVal TimeDiff
0   89  2012-03-10  2012-03-10 00:00:08.483 2012-03-10 00:00:11.607 41.5     41.0   00:00:03.124000
1   181 2012-03-10  2012-03-10 00:02:49.687 2012-03-10 00:02:52.813 41.5     41.0   00:00:03.126000
                                            ...

更新：

尝试：

            total_t.append(long_event)
            total_t = pd.Series(total_t)
#             total_t = pd.DataFrame(pd.concat([total_t]))
            total_t = pd.concat([total_t])

仅返回一行：

0     Id   Date       StartTime               EndTime                   StartVal 
EndVal TimeDiff
37    235  2012-03-10 2012-03-10 19:43:32.260 2012-03-10 19:48:06.270   42.0 
41.5   00:04:34.010000
dtype: object

Answer 1

IIUC您要返回一个pd.DataFrame，其中每个YearMonth和每个Id在5分钟内包含最大TimeDiff。是吗？

首先对您的代码发表评论：

在total_t.append(pd.Series(long_event))的第一个迭代中total_t是一个列表，因此您可以像往常一样附加到位。
下一行：total_t = pd.concat([total_t])将您的列表转换为系列
从第二次迭代开始，将执行total_t.append(pd.Series(long_event))，但是当total_t成为pd.Series时，附加操作不再有效。

我建议您进行以下调整：

要获取与最大TimeDiff对应的行，可以使用以下代码：

long_event = res_df.loc[res_df['TimeDiff'].idxmax()]

然后，您可以像已经执行的操作一样附加它们（无需转换为pd.Series）：

total_t.append(long_event)

最后，我将通过在返回之前添加total_t = pd.concat([total_t])来删除

total_t = pd.DataFrame(total_t)

最终代码如下：

            long_event = res_df.loc[res_df['TimeDiff'].idxmax()]
            total_t.append(long_event)

    total_t = pd.DataFrame(total_t)
    return total_t

使用示例数据进行测试

创建虚假数据

df = pd.DataFrame({'TimeDiff': ['00:01:13.124000', '00:00:03.124000', '00:12:03.126000', '00:04:54.124000'], 
                   'Other': [1, 2, 3, 4]})
df['TimeDiff'] = pd.to_timedelta(df.TimeDiff)

结果

df = df.loc[(df['TimeDiff']> datetime.timedelta(seconds=60)) & (df['TimeDiff']<datetime.timedelta(minutes=5))]
total_t = []
long_event = df.loc[df['TimeDiff'].idxmax()]
total_t.append(long_event)

# after the for loops are completed in your case
total_t = pd.DataFrame(total_t)
total_t
    TimeDiff                    Other
3   0 days 00:04:54.124000000   4

这里的Other列只是为了证明结果是DataFrame，因此最后包含所有其他列。

在嵌套的for循环中过滤子数据帧

1 个答案: