Question

我一直在阅读有关如何避免使用 iterrows 迭代 Pandas DataFrame 的最佳实践，但我不确定我还能如何解决我的特定问题：

我该怎么做：

在一个DataFrame中找到值“c”的第一个实例的“时间”，df1，按“num”分组，按“时间”排序
然后根据“num”将该“时间”添加到单独的 DataFrame df2 中。

这是我输入的 DataFrame 的示例：

import pandas as pd

df = pd.DataFrame({'num': [2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 
                           8, 8, 8, 8, 9, 9, 9, 9, 9], 
                   'state': ['a', 'b', 'c', 'b', 'a', 'b', 'c', 'b', 'c', 'b', 'c', 'a', 
                             'b', 'c', 'b', 'c', 'b', 'c', 'a', 'b', 'c', 'b', 'c', 'b', 
                             'c', 'b', 'c', 'b', 'c', 'b'],
                   'time': [234, 239, 244, 249, 100, 105, 110, 115, 120, 125, 130, 3, 8, 
                            13, 18, 23, 28, 33, 551, 556, 561, 566, 571, 576, 581, 45, 50, 
                            55, 60, 65]})

预期输出（df2）：

我尝试的每个解决方案似乎都需要 iterrows 将“时间”加载到 df2 中。

Answer 1

您可以在一行中完成，使用 df.groupby() 和 min() 作为聚合函数：

df[df.state == 'c'].drop('state', axis=1).groupby('num').aggregate(min)

Answer 2

如果不重新创建 df 很难检查，但我认为应该这样做

def first_c(group):
    filtered = group[group['state'] == 'c'].iloc[0]
    return filtered[['num', 'time']]


df2 = df.groupby('num').apply(first_c)

按数量分组
对c应用函数和过滤器，用iloc找到第一个整数索引
返回编号和时间

我如何不使用 iterrows 来解决我的问题？

2 个答案: