在拉伸中查找最大空值并生成标志

时间:2019-05-15 11:12:28

标签: python pandas missing-data

我有一个带有日期时间和两列的数据框,我必须在'X'列的'特定日期'中找到最大的空值范围,并在该特定日期的两列中将其替换为零。除此之外,我还必须创建名称为'flag'的第三列,该列在其他两列中的每个零插补将携带1的值,否则将为0。在下面的示例中,1月1日的最大拉伸空值是3倍,因此我必须将其替换为零。同样,我必须复制1月2日的流程。

以下是我的示例数据:

Datetime            X    Y
01-01-2018 00:00    1   1
01-01-2018 00:05    nan 2
01-01-2018 00:10    2   nan
01-01-2018 00:15    3   4
01-01-2018 00:20    2   2
01-01-2018 00:25    nan 1
01-01-2018 00:30    nan nan
01-01-2018 00:35    nan nan
01-01-2018 00:40    4   4
02-01-2018 00:00    nan nan
02-01-2018 00:05    2   3
02-01-2018 00:10    2   2
02-01-2018 00:15    2   5
02-01-2018 00:20    2   2
02-01-2018 00:25    nan nan
02-01-2018 00:30    nan 1
02-01-2018 00:35    3   nan
02-01-2018 00:40    nan nan

“下面是我期望的结果”

Datetime           X    Y   Flag
01-01-2018 00:00    1   1   0
01-01-2018 00:05    nan 2   0
01-01-2018 00:10    2   nan 0
01-01-2018 00:15    3   4   0
01-01-2018 00:20    2   2   0
01-01-2018 00:25    0   0   1
01-01-2018 00:30    0   0   1
01-01-2018 00:35    0   0   1
01-01-2018 00:40    4   4   0
02-01-2018 00:00    nan nan 0
02-01-2018 00:05    2   3   0
02-01-2018 00:10    2   2   0
02-01-2018 00:15    2   5   0
02-01-2018 00:20    2   2   0
02-01-2018 00:25    nan nan 0
02-01-2018 00:30    nan 1   0
02-01-2018 00:35    3   nan 0
02-01-2018 00:40    nan nan 0

此问题是上一个问题的扩展。这是链接Python - Find maximum null values in stretch and replacing with 0

1 个答案:

答案 0 :(得分:2)

首先为每个由唯一值填充的列创建连续的组:

df1 = df.isna()
df2 = df1.ne(df1.groupby(df1.index.date).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
                        X      Y
Datetime                        
2018-01-01 00:00:00   NaN    NaN
2018-01-01 00:05:00   2.0    NaN
2018-01-01 00:10:00   NaN   36.0
2018-01-01 00:15:00   NaN    NaN
2018-01-01 00:20:00   NaN    NaN
2018-01-01 00:25:00   4.0    NaN
2018-01-01 00:30:00   4.0   72.0
2018-01-01 00:35:00   4.0   72.0
2018-01-01 00:40:00   NaN    NaN
2018-02-01 00:00:00   6.0  108.0
2018-02-01 00:05:00   NaN    NaN
2018-02-01 00:10:00   NaN    NaN
2018-02-01 00:15:00   NaN    NaN
2018-02-01 00:20:00   NaN    NaN
2018-02-01 00:25:00   8.0  144.0
2018-02-01 00:30:00   8.0    NaN
2018-02-01 00:35:00   NaN  180.0
2018-02-01 00:40:00  10.0  180.0

然后获取具有最大数量的组-这里是组4

a = df2.stack().value_counts().index[0]
print (a)
4.0

获取集合0的匹配行和Flag列的转换掩码,将掩码转换为从Tru/False1/0的整数:

mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)

print (df)
                       X    Y  Flag
Datetime                           
2018-01-01 00:00:00  1.0  1.0     0
2018-01-01 00:05:00  NaN  2.0     0
2018-01-01 00:10:00  2.0  NaN     0
2018-01-01 00:15:00  3.0  4.0     0
2018-01-01 00:20:00  2.0  2.0     0
2018-01-01 00:25:00  0.0  0.0     1
2018-01-01 00:30:00  0.0  0.0     1
2018-01-01 00:35:00  0.0  0.0     1
2018-01-01 00:40:00  4.0  4.0     0
2018-02-01 00:00:00  NaN  NaN     0
2018-02-01 00:05:00  2.0  3.0     0
2018-02-01 00:10:00  2.0  2.0     0
2018-02-01 00:15:00  2.0  5.0     0
2018-02-01 00:20:00  2.0  2.0     0
2018-02-01 00:25:00  NaN  NaN     0
2018-02-01 00:30:00  NaN  1.0     0
2018-02-01 00:35:00  3.0  NaN     0
2018-02-01 00:40:00  NaN  NaN     0

编辑:

为列表中的匹配日期添加了新条件:

dates = df.index.floor('d')

filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]

df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)

print (df2)
                       X     Y
Datetime                      
2018-01-01 00:00:00  NaN   NaN
2018-01-01 00:05:00  2.0   NaN
2018-01-01 00:10:00  NaN  36.0
2018-01-01 00:15:00  NaN   NaN
2018-01-01 00:20:00  NaN   NaN
2018-01-01 00:25:00  4.0   NaN
2018-01-01 00:30:00  4.0  72.0
2018-01-01 00:35:00  4.0  72.0
2018-01-01 00:40:00  NaN   NaN
2018-02-01 00:00:00  NaN   NaN
2018-02-01 00:05:00  NaN   NaN
2018-02-01 00:10:00  NaN   NaN
2018-02-01 00:15:00  NaN   NaN
2018-02-01 00:20:00  NaN   NaN
2018-02-01 00:25:00  NaN   NaN
2018-02-01 00:30:00  NaN   NaN
2018-02-01 00:35:00  NaN   NaN
2018-02-01 00:40:00  NaN   NaN

a = df2.stack().value_counts().index[0]
#solution working also if no NaNs per filtered rows (prevent IndexError: index 0 is out of bounds)
#a = next(iter(df2.stack().value_counts().index), -1)

mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)

print (df)
                       X    Y  Flag
Datetime                           
2018-01-01 00:00:00  1.0  1.0     0
2018-01-01 00:05:00  NaN  2.0     0
2018-01-01 00:10:00  2.0  NaN     0
2018-01-01 00:15:00  3.0  4.0     0
2018-01-01 00:20:00  2.0  2.0     0
2018-01-01 00:25:00  0.0  0.0     1
2018-01-01 00:30:00  0.0  0.0     1
2018-01-01 00:35:00  0.0  0.0     1
2018-01-01 00:40:00  4.0  4.0     0
2018-02-01 00:00:00  NaN  NaN     0
2018-02-01 00:05:00  2.0  3.0     0
2018-02-01 00:10:00  2.0  2.0     0
2018-02-01 00:15:00  2.0  5.0     0
2018-02-01 00:20:00  2.0  2.0     0
2018-02-01 00:25:00  NaN  NaN     0
2018-02-01 00:30:00  NaN  1.0     0
2018-02-01 00:35:00  3.0  NaN     0