基于模式长度的熊猫阈值数据序列

时间:2019-03-06 03:01:50

标签: python pandas data-cleaning

我有这个数据框

    A
0   -2
1   0
2   2
3   2
4   0
5   0
6   0
7   0
8   0
9   0
10  0
11  0
12  2
13  2
14  2
15  2
16  2
17  3
18  2
19  0
20  2
21  2
22  2

它的情节是这样的

enter image description here

我想根据序列的长度对数据进行阈值处理,以使B部分变平,因为它的长度小于3,如下所示

enter image description here

2 个答案:

答案 0 :(得分:1)

好吧,首先让我们创建一个数据框

df = pd.DataFrame([-2,0,2,2,0,0,0,0,0,0,0,0,2,2,2,2,2,3,2,0,2,2,2,0,3,3,0])
df.columns = ['A']
df

为了进行理智检查,我在末尾添加了两个3和一个4,这给了我们

    A
0   -2
1   0
2   2
3   2
4   0
5   0
6   0
7   0
8   0
9   0
10  0
11  0
12  2
13  2
14  2
15  2
16  2
17  3
18  2
19  0
20  2
21  2
22  2
23  0
24  3
25  3
26  0

现在,我们必须查看为此用途必须将哪些元素设置为零

prev = None
flag = 0
terminationLst = []
for val,i in zip(df['A'],df.index):
  if val == 0 and prev == None: #First time encountering a zero element
    prev = i 
    continue
  if val !=0 and prev != None: #Encountering a non zero element after having seen a zero
    flag = 1
  elif val == 0 and i-prev > 3: Encountering a zero after more than 3 consecutive none zeros
    prev = i
  elif val == 0 and i-prev <=3 and flag ==1: #Encountering a zero after less than 3 consecutive non zeros
    flag = 0
    terminationLst.append([x for x in range(prev+1,i)])
    prev = i
print (terminationLst)

这为我们提供了需要变为零的元素的索引[[2, 3], [24, 25], [27]]

现在我们只需要将它们设置为零即可,

for elem in terminationLst:
  df['A'].iloc[elem] = 0

现在数据框变为

    A
0   -2
1   0
2   0
3   0
4   0
5   0
6   0
7   0
8   0
9   0
10  0
11  0
12  2
13  2
14  2
15  2
16  2
17  3
18  2
19  0
20  2
21  2
22  2
23  0
24  0
25  0
26  0
27  0
28  0

如果您在理解任何特定部分时遇到任何问题,请在下面发表评论。

答案 1 :(得分:1)

没有for循环的替代解决方案(使用@ anand_v.singh的答案中的df):

  1. 高于基线(y = 0)的记录的掩码:
    positive_mask = df>0
  1. 序列正值的标签组:
    sequence_groups = positive_mask.astype(int).diff(1).fillna(0).abs().cumsum().squeeze()
  1. 检查每个序列组的大小
    sequence_size = positive_mask.groupby(sequence_groups).transform(len)
  1. 将它们放在一起(仅用于查看数据框和步骤结果并排显示)
    df_extended = pd.concat([df, positive_mask, sequence_groups, sequence_size], axis=1)
    df_extended.columns = ['value', 'is_positive', 'sequence_group', 'sequence_size']
    df_extended

        value  is_positive  sequence_group  sequence_size
    0      -2        False             0.0              2
    1       0        False             0.0              2
    2       2         True             1.0              2
    3       2         True             1.0              2
    4       0        False             2.0              8
    5       0        False             2.0              8
    6       0        False             2.0              8
    7       0        False             2.0              8
    8       0        False             2.0              8
    9       0        False             2.0              8
    10      0        False             2.0              8
    11      0        False             2.0              8
    12      2         True             3.0              7
    13      2         True             3.0              7
    14      2         True             3.0              7
    15      2         True             3.0              7
    16      2         True             3.0              7
    17      3         True             3.0              7
    18      2         True             3.0              7
    19      0        False             4.0              1
    20      2         True             5.0              3
    21      2         True             5.0              3
    22      2         True             5.0              3
    23      0        False             6.0              1
    24      3         True             7.0              2
    25      3         True             7.0              2
    26      0        False             8.0              1
  1. 平整所有正值且序列大小小于3的值。
    flat_mask = (df_extended.sequence_size < 3) & (df_extended.is_positive)
    df_extended.loc[flat_mask, 'value'] = 0
  1. 情节
    df_extended.value.plot()

enter image description here