Question

具有以下格式的数据框：

df = pd.DataFrame({
                'A': ('foo', 'foo', 'foo', 'foo', 'foo'),
                'start': (3039, 3536, 9140, 12976, 14982),
                'end': (3536, 4879, 44331, 13641, 15643)
                 })

    A   start   end
0   foo 3039    3536
1   foo 3536    4879
2   foo 9140    44331
3   foo 12976   13641
4   foo 14982   15643

如何删除start和end列所确定的“范围”与其他行重叠的所有行？在上面的示例中，索引为3和4的行将被删除，因为它们包含在行索引2中。

我尝试从shift()开始尝试创建一个遮罩系列，但是除了由于所有值都是False而无法正常工作之外，它只会与上一行进行比较，而我想比较所有行范围。

ranges_mask = ((df['start'] > df['start'].shift(-1)) & (df['end'] < df['end'].shift(-1)))

Answer 1

这是一个解决方案在这里，我们只考虑另一个内部完全有间隔的情况：

df2=df.copy()
groups=pd.Series([1]*len(df))
while (groups.value_counts()>1).any():
    groups=( df2['start'].gt(df2['start'].shift())  &
             df2['end'].gt(df2['end'].shift()) ).cumsum()
    print(groups)
    df2=df2.groupby(groups,as_index=False).first()

print(df2)

输出

0    0
1    1
2    2
3    2
4    3
dtype: int64
0    0
1    1
2    2
3    2
dtype: int64
0    0
1    1
2    2
dtype: int64
     A  start    end
0  foo   3039   3536
1  foo   3536   4879
2  foo   9140  44331

熊猫-删除数据帧中的重叠范围

1 个答案: