Question

我有一个pandas DataFrame。例如，

        Date   Time    A     B   C 
0   1.1.2015  00:00    2    16  50  
1   1.1.2015  01:00    2     9  50   
2   1.1.2015  02:00    4     6  50   
3   1.1.2015  03:00    3     7  31  
4   1.1.2015  04:00    2     7  42    
5   1.1.2015  05:00    2     7  22    
6   1.1.2015  06:00    2     7  14  
7   1.1.2015  07:00    2    11  50    
8   1.1.2015  08:00    3    11  28   
9   1.1.2015  09:00    2    18  17

我想删除连续重复超过3次但保留第一个的数字。我需要删除：

1-行5,6和7，因为A列中有4个2，我不需要最后3个。

2行4,5和6因为B列中有4个7

3-行1和2，因为C列中有三个50

所以我想要的输出就像：

           Date   Time    A     B   C 
0      1.1.2015  00:00    2    16  50  
1      1.1.2015  03:00    3     7  31 
2      1.1.2015  08:00    3    11  28   
3      1.1.2015  09:00    2    18  17

我搜索过类似的问题，我发现这个问题最相似： “Removing values that repeat more than 5 times in Pandas DataFrame”我尝试将其调整为我的问题，但我不能（我是python的初学者）。任何人都可以帮助我吗？

感谢。

Answer 1

您可以使用itertools帮助：

import itertools
import numpy as np

def f(serie):
    xs = []
    for el, gr in itertools.groupby(serie):
        x = np.repeat(True, len(list(gr)))
        if len(x)>=3:
           x[1:]=False
        xs.append(x)
    return np.concatenate(xs)

df[df[['A','B','C']].apply(f, axis=0).apply(np.all, axis=1)]

#Out[64]:
#       Date   Time  A   B   C
#0  1.1.2015  00:00  2  16  50
#3  1.1.2015  03:00  3   7  31
#8  1.1.2015  08:00  3  11  28
#9  1.1.2015  09:00  2  18  17

这个想法是使用效用函数f来计算列中连续元素的数量并创建相关的所需布尔掩码 - 例如，您可以检查f(df['A'])的结果。然后使用np.all聚合这些布尔掩码来过滤原始数据帧。

Answer 2

def remover(x=[]):
    new = []
    cnt = Counter
    op = list(cnt(x).elements())
    op = np.array(op,dtype=np.int32)
    x = pd.Series(op)
    num = x.value_counts()
    num = np.array(num,dtype = np.int32)
    for val in num:
        if val >= 3:
            del(val)
    new.append(val) 
    print(new)

删除重复次数超过3次的值，但Pandas DataFrame中的第一个除外

2 个答案: