Question

我的数据框如下所示。有> = 1个连续的行，其中y_l被填充，y_h是NaN，反之亦然。当我们在NaN之间有超过1个连续填充的行时，我们只想保留具有最低y_l或最高y_h的那一行。例如在最后3行以下的df中，我们只保留第2行并丢弃其他两行。实现这一目标的智能方法是什么？

df = pd.DataFrame({'y_l': [NaN, 97,95,98,NaN],'y_h': [90, NaN,NaN,NaN,95]}, columns=['y_l','y_h'])

>>> df

   y_l   y_h
0  NaN   90.0
1  97.0  NaN
2  95.0  NaN
3  98.0  NaN
4  NaN   95

期望的结果：

     y_l  y_h
0    NaN  90.0
1    95.0  NaN
2    NaN   95

Answer 1

您需要创建新列或Series以区分每个连续，然后使用groupby使用agg的aggreagte，最后使用reindex列的更改顺序：

a = df['y_l'].isnull()
b = a.ne(a.shift()).cumsum()
df = (df.groupby(b, as_index=False)
        .agg({'y_l':'min', 'y_h':'max'})
        .reindex(columns=['y_l','y_h']))
print (df)
    y_l   y_h
0   NaN  90.0
1  95.0   NaN
2   NaN  95.0

详情：

print (b)
0    1
1    2
2    2
3    2
4    3
Name: y_h, dtype: int32

Answer 2

如果您有更多列，该怎么办？例如

df = pd.DataFrame({'A': [NaN, 15,20,25,NaN],'y_l': [NaN,    97,95,98,NaN],'y_h': [90, NaN,NaN,NaN,95]}, columns=['A','y_l','y_h'])
>>>df

     A      y_l     y_h
0   NaN     NaN     90.0
1   15.0    97.0    NaN
2   20.0    95.0    NaN
3   25.0    98.0    NaN
4   NaN     NaN     95.0

如何在过滤掉不相关的行后如何保留A列中的值？

     A      y_l     y_h
0   NaN     NaN     90.0
1   20.0    95.0    NaN
2   NaN     NaN     95.0

如何在pandas数据帧中过滤连续数据行btw NaN行？

2 个答案: