计算连续缺失值的数量

时间:2020-02-24 15:32:01

标签: python

我正在尝试找到一种方法来计算从数据帧中随机删除的值的数量以及一个又一个随机删除的值的数量。

到目前为止,我的代码是:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Sampledata
x=[1,2,3,4,5,6,7,8,9,10]
y=[1,2,3,4,5,6,7,8,9,10]

df = pd.DataFrame({'col_1':y,'col_2':x})

drop_indices = np.random.choice(df.index, 5,replace=False )
df_subset = df.drop(drop_indices)

print(df_subset)
print(df)

从数据框中随机删除5行并给出输出:

  col_1  col_2
0      1      1
1      2      2
2      3      3
5      6      6
8      9      9
   col_1  col_2
0      1      1
1      2      2
2      3      3
3      4      4
4      5      5
5      6      6
6      7      7
7      8      8
8      9      9
9     10     10

我想将其转换为以下数据框:

  col_1 col_2 col_2 N_removedvalues   N_consecutive
0     1    1     1    0                 0
1     2    2     2    0                 0
2     3    3     3    0                 0
3     4    4          1                 1
4     5    5          2                 2
5     6    6     6    2                 0
6     7    7          3                 1
7     8    8          4                 2
8     9    9     9    4                 0
9     10   10         5                 1

1 个答案:

答案 0 :(得分:0)

IIUC:

res=df.merge(df_subset, on='col_1', suffixes=['_1',''], how='left')

res["N_removedvalues"]=np.where(res['col_2'].isna(), res.groupby(res['col_2'].isna()).cumcount().add(1), np.nan)

res["N_removedvalues"]=res["N_removedvalues"].ffill().fillna(0)

res['N_consecutive']=np.logical_and(res['col_2'].isna(), np.logical_or(~res['col_2'].shift().isna(), res.index==res.index[0]))

res.loc[np.logical_and(res['N_consecutive']==0, res['col_2'].isna()), 'N_consecutive']=np.nan

res['N_consecutive']=res.groupby('N_consecutive')['N_consecutive'].cumsum().ffill()

res.loc[res['N_consecutive'].gt(0), 'N_consecutive']=res.loc[res['N_consecutive'].gt(0)].groupby('N_consecutive').cumcount().add(1)

输出:

   col_1  col_2_1  col_2  N_removedvalues  N_consecutive
0      1        1    1.0              0.0            0.0
1      2        2    2.0              0.0            0.0
2      3        3    NaN              1.0            1.0
3      4        4    4.0              1.0            0.0
4      5        5    NaN              2.0            1.0
5      6        6    NaN              3.0            2.0
6      7        7    7.0              3.0            0.0
7      8        8    8.0              3.0            0.0
8      9        9    NaN              4.0            1.0
9     10       10    NaN              5.0            2.0