Question

我具有以下数据结构：

     |a       |b     |start_time  |end_time
0    |aaba    |d     |11:26       | 11:27
1    |aba     |c     |11:27       | 11:32
2    |aba     |c     |11:32       | 11:34
3    |cab     |ab    |11:34       | 11:35
4    |aba     |c     |11:35       | 11:40

我想合并列a和b上重复的连续行，然后将新行的start_time和end_time更新为较早的分别是两个和第二个中的一个。

因为条目是连续的，所以这意味着保留第一行的start_time和第二行的end_time。通常有两个重复的副本。

因此，在上述情况下，我想合并1和2行，最后得到：

     |a    |b    |start_time  |end_time
0    |aaba    |d     |11:26       | 11:27
1    |aba     |c     |11:27       | 11:34
2    |cab     |ab    |11:34       | 11:35
3    |aba     |c     |11:35       | 11:40

我尝试使用loc，并在第一次运行时更新了end_time列，第二次删除了重复项，但是两次运行loc似乎很浪费：

df.loc[(df['a']+df['b']) == (df['a']+df['b']).shift(-1), 'end_time'] = df['end_time'].shift(-1)

df = df.loc[(df['a']+df['b']) != (df['a']+df['b']).shift(-1)]

是否有一种方法可以删除重复项并仅通过一次迭代来更新end_time值？

Answer 1

在groupby上进行a，b和b在连续as_index=False上进行groupID。每个组agg最低start_time，最高end_time

df.groupby(['a','b', df.b.ne(df.b.shift()).cumsum()], as_index=False).agg({'start_time': 'min', 'end_time': 'max'})

Out[1649]:
      a   b start_time end_time
0  aaba   d      11:26    11:27
1   aba   c      11:27    11:34
2   aba   c      11:35    11:40
3   cab  ab      11:34    11:35

如何从熊猫数据框中删除连续的重复行，同时更新列值？

1 个答案: